Difference between revisions of "UDPipe"

From Apertium
Jump to navigation Jump to search
(minor fix: a word was repeated accidentally)
(44 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==First things first==

This section describes how to set up and train/test UDPipe. If you've already done this, then you can move on to the next part.

;Get the code!
git clone https://github.com/ufal/udpipe
git clone https://github.com/ufal/udpipe
cd udpipe/src
cd udpipe/src
git clone https://github.com/UniversalDependencies/UD_Norwegian
cd UD_Norwegian
cat no-ud-train.conllu |../udpipe --tokenizer epochs=6 --train nob.udpipe

Now copy the <code>udpipe/src/udpipe</code> binary executable to somewhere in your <code>$PATH</code>.

;Get some data!
git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal
cat no-ud-test.conllu |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' |../udpipe --parse nob.udpipe > output
cd UD_Norwegian-Bokmaal

;Train a default model

With tokeniser and tagger:
cat no_bokmaal-ud-train.conllu | udpipe --train nob.udpipe

Without tokeniser and tagger:
cat no_bokmaal-ud-train.conllu | udpipe --tokenizer none --tagger none --train nob.udpipe

; Parse some input

With gold standard POS tags:
cat no_bokmaal-ud-dev.conllu |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe > output

Full pipeline:
echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe

; Calculate accuracy

udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu

;Get more stuff!

You'll also need a couple of scripts from https://github.com/ftyers/ud-scripts

git clone https://github.com/ftyers/ud-scripts.git


For playing with the parameters we're going to try a smaller treebank:

git clone https://github.com/UniversalDependencies/UD_Turkish
cd UD_Turkish

The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options.

===Default parsing options===

First of all try training the parser with the default options. These are:

* Parsing algorithm is <code>projective</code>
* Number of training iterations is 10
* Hidden layer size is 200

udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu

You can also download a pretrained model for Turkish trained using the default parsing options here:

* http://ilazki.thinkgeek.co.uk/tur.proj.udpipe

===Using the <code>swap</code> algorithm===

If we want to support parsing non-projective trees we can use the <code>swap</code> algorithm:

udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu

(This will take around 15 minutes)

If you want to see how many trees are non-projective before and after, you can use the script: <code>conllu-extract-non-projective.py</code>

python3 conllu-extract-non-projective.py < tr-ud-train.conllu.conllu > tr-ud-train.nonproj.conllu

===Using external embeddings===

For calculating the word embeddings we'll use <code>word2vec</code>. UDPipe can directly use this kind of embedding file.

; Compile word2vec
git clone https://github.com/dav/word2vec.git
cd word2vec/src

Now copy the word2vec binaries somewhere in your $PATH.

; Get a corpus and tokenise
wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip
unzip en-tr.txt.zip
cat SETIMES2.en-tr.tr | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/ */ /g' > tur-tok.crp.txt

; Train word2vec:
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \
-threads 12 -binary 0 -iter 15 -min-count 2

These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4&mdash;5 minutes to train.

;Now use the embeddings with UDpipe

udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu

===Classifier settings===

;Increase the size of the hidden layer

udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu

The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.

;Train for a different number of iterations/epochs

udpipe --tokenizer none --tagger none --parser "iterations=5" --train tur.iter5.udpipe < tr-ud-train.conllu

The default is 10, try some numbers in the range of [1, 15]

==Parser combination==


The important scripts for this section are:

* <code>conllu-voting.py</code>: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U files
* <code>conllu-eval.py</code>: Calculates LAS and UAS.

;Spanning tree algorithm

python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu

Now try it on your own data.

==Using the CoNLL evaluation script==

Open this link: http://universaldependencies.org/conll17/evaluation.html

Then download the evaluation script [http://universaldependencies.org/conll17/eval.zip here].

Use the evaluation script like this:

$ python3 evaluation_script/conll17_ud_eval.py gold_treebank.conllu system_output.conllu

==More ideas==

* Try training only on projective data. You can use the script <code>conllu-extract-projective.py</code> to make a subset of your treebank that only has projective trees.
* Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective.
* Find a bug in one of the scripts and report it on Github.
* Describe an alternative weighting method for parser combination, could better results be had with learnt weights ?


Latest revision as of 19:43, 9 March 2020

First things first[edit]

This section describes how to set up and train/test UDPipe. If you've already done this, then you can move on to the next part.

Get the code!
git clone https://github.com/ufal/udpipe
cd udpipe/src

Now copy the udpipe/src/udpipe binary executable to somewhere in your $PATH.

Get some data!
git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal
cd UD_Norwegian-Bokmaal
Train a default model

With tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --train nob.udpipe                  

Without tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --tokenizer none --tagger none --train nob.udpipe                  
Parse some input

With gold standard POS tags:

cat no_bokmaal-ud-dev.conllu  |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe  > output               

Full pipeline:

echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe
Calculate accuracy
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu
Get more stuff!

You'll also need a couple of scripts from https://github.com/ftyers/ud-scripts

git clone https://github.com/ftyers/ud-scripts.git


For playing with the parameters we're going to try a smaller treebank:

git clone https://github.com/UniversalDependencies/UD_Turkish
cd UD_Turkish

The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options.

Default parsing options[edit]

First of all try training the parser with the default options. These are:

  • Parsing algorithm is projective
  • Number of training iterations is 10
  • Hidden layer size is 200
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu 

You can also download a pretrained model for Turkish trained using the default parsing options here:

Using the swap algorithm[edit]

If we want to support parsing non-projective trees we can use the swap algorithm:

udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu 

(This will take around 15 minutes)

If you want to see how many trees are non-projective before and after, you can use the script: conllu-extract-non-projective.py

python3 conllu-extract-non-projective.py < tr-ud-train.conllu.conllu > tr-ud-train.nonproj.conllu

Using external embeddings[edit]

For calculating the word embeddings we'll use word2vec. UDPipe can directly use this kind of embedding file.

Compile word2vec
git clone https://github.com/dav/word2vec.git
cd word2vec/src

Now copy the word2vec binaries somewhere in your $PATH.

Get a corpus and tokenise
wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip
unzip en-tr.txt.zip
cat SETIMES2.en-tr.tr | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/  */ /g' > tur-tok.crp.txt
Train word2vec
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \
 -threads 12 -binary 0 -iter 15 -min-count 2

These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4—5 minutes to train.

Now use the embeddings with UDpipe
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu 

Classifier settings[edit]

Increase the size of the hidden layer
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu 

The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.

Train for a different number of iterations/epochs
udpipe --tokenizer none --tagger none --parser "iterations=5"  --train tur.iter5.udpipe < tr-ud-train.conllu 

The default is 10, try some numbers in the range of [1, 15]

Parser combination[edit]


The important scripts for this section are:

  • conllu-voting.py: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U files
  • conllu-eval.py: Calculates LAS and UAS.
Spanning tree algorithm
python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu

Now try it on your own data.

Using the CoNLL evaluation script[edit]

Open this link: http://universaldependencies.org/conll17/evaluation.html

Then download the evaluation script here.

Use the evaluation script like this:

$ python3 evaluation_script/conll17_ud_eval.py gold_treebank.conllu system_output.conllu

More ideas[edit]

  • Try training only on projective data. You can use the script conllu-extract-projective.py to make a subset of your treebank that only has projective trees.
  • Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective.
  • Find a bug in one of the scripts and report it on Github.
  • Describe an alternative weighting method for parser combination, could better results be had with learnt weights ?