Difference between revisions of "UDPipe"
(minor fix: a word was repeated accidentally) |
|||
(28 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
==First things first== |
==First things first== |
||
This section describes how to set up and train/test UDPipe. If you've already done this, then you can move on to the next part. |
|||
;Get the code! |
;Get the code! |
||
Line 45: | Line 47: | ||
<pre> |
<pre> |
||
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu |
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu |
||
</pre> |
|||
;Get more stuff! |
|||
You'll also need a couple of scripts from https://github.com/ftyers/ud-scripts |
|||
<pre> |
|||
git clone https://github.com/ftyers/ud-scripts.git |
|||
</pre> |
</pre> |
||
Line 55: | Line 65: | ||
cd UD_Turkish |
cd UD_Turkish |
||
</pre> |
</pre> |
||
The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options. |
|||
===Default parsing options=== |
===Default parsing options=== |
||
First of all try training the parser with the default options. These are: |
|||
* Parsing algorithm is <code>projective</code> |
|||
* Number of training iterations is 10 |
|||
* Hidden layer size is 200 |
|||
<pre> |
<pre> |
||
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu |
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu |
||
</pre> |
</pre> |
||
You can also download a pretrained model for Turkish trained using the default parsing options here: |
|||
* http://ilazki.thinkgeek.co.uk/tur.proj.udpipe |
|||
===Using the <code>swap</code> algorithm=== |
===Using the <code>swap</code> algorithm=== |
||
Line 71: | Line 93: | ||
(This will take around 15 minutes) |
(This will take around 15 minutes) |
||
If you want to see how many trees are non-projective before and after, you can use the script: <code>conllu-extract-non-projective.py</code> |
|||
<pre> |
|||
python3 conllu-extract-non-projective.py < tr-ud-train.conllu.conllu > tr-ud-train.nonproj.conllu |
|||
</pre> |
|||
===Using external embeddings=== |
===Using external embeddings=== |
||
For calculating the word embeddings we'll use <code>word2vec</code>. UDPipe can directly use this kind of embedding file. |
|||
; Compile word2vec |
; Compile word2vec |
||
Line 79: | Line 109: | ||
cd word2vec/src |
cd word2vec/src |
||
make |
make |
||
</pre> |
</pre> |
||
Now copy the word2vec binaries somewhere in your $PATH. |
|||
; Get a corpus and tokenise |
; Get a corpus and tokenise |
||
<pre> |
<pre> |
||
wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip |
|||
⚫ | |||
unzip en-tr.txt.zip |
|||
⚫ | |||
</pre> |
</pre> |
||
Line 92: | Line 125: | ||
-threads 12 -binary 0 -iter 15 -min-count 2 |
-threads 12 -binary 0 -iter 15 -min-count 2 |
||
</pre> |
</pre> |
||
These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4—5 minutes to train. |
|||
;Now use the embeddings with UDpipe |
;Now use the embeddings with UDpipe |
||
Line 98: | Line 133: | ||
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu |
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu |
||
</pre> |
</pre> |
||
===Classifier settings=== |
|||
;Increase the size of the hidden layer |
|||
<pre> |
|||
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu |
|||
</pre> |
|||
The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens. |
|||
;Train for a different number of iterations/epochs |
|||
<pre> |
|||
udpipe --tokenizer none --tagger none --parser "iterations=5" --train tur.iter5.udpipe < tr-ud-train.conllu |
|||
</pre> |
|||
The default is 10, try some numbers in the range of [1, 15] |
|||
==Parser combination== |
|||
;Prerequisites |
|||
The important scripts for this section are: |
|||
* <code>conllu-voting.py</code>: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U files |
|||
* <code>conllu-eval.py</code>: Calculates LAS and UAS. |
|||
;Spanning tree algorithm |
|||
<pre> |
|||
python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu |
|||
</pre> |
|||
Now try it on your own data. |
|||
==Using the CoNLL evaluation script== |
|||
Open this link: http://universaldependencies.org/conll17/evaluation.html |
|||
Then download the evaluation script [http://universaldependencies.org/conll17/eval.zip here]. |
|||
Use the evaluation script like this: |
|||
<pre> |
|||
$ python3 evaluation_script/conll17_ud_eval.py gold_treebank.conllu system_output.conllu |
|||
</pre> |
|||
==More ideas== |
|||
* Try training only on projective data. You can use the script <code>conllu-extract-projective.py</code> to make a subset of your treebank that only has projective trees. |
|||
* Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective. |
|||
* Find a bug in one of the scripts and report it on Github. |
|||
* Describe an alternative weighting method for parser combination, could better results be had with learnt weights ? |
|||
[[Category:Tools|*]] |
[[Category:Tools|*]] |
Latest revision as of 19:43, 9 March 2020
First things first[edit]
This section describes how to set up and train/test UDPipe. If you've already done this, then you can move on to the next part.
- Get the code!
git clone https://github.com/ufal/udpipe cd udpipe/src make
Now copy the udpipe/src/udpipe
binary executable to somewhere in your $PATH
.
- Get some data!
git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal cd UD_Norwegian-Bokmaal
- Train a default model
With tokeniser and tagger:
cat no_bokmaal-ud-train.conllu | udpipe --train nob.udpipe
Without tokeniser and tagger:
cat no_bokmaal-ud-train.conllu | udpipe --tokenizer none --tagger none --train nob.udpipe
- Parse some input
With gold standard POS tags:
cat no_bokmaal-ud-dev.conllu |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe > output
Full pipeline:
echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe
- Calculate accuracy
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu
- Get more stuff!
You'll also need a couple of scripts from https://github.com/ftyers/ud-scripts
git clone https://github.com/ftyers/ud-scripts.git
Parameters[edit]
For playing with the parameters we're going to try a smaller treebank:
git clone https://github.com/UniversalDependencies/UD_Turkish cd UD_Turkish
The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options.
Default parsing options[edit]
First of all try training the parser with the default options. These are:
- Parsing algorithm is
projective
- Number of training iterations is 10
- Hidden layer size is 200
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu
You can also download a pretrained model for Turkish trained using the default parsing options here:
Using the swap
algorithm[edit]
If we want to support parsing non-projective trees we can use the swap
algorithm:
udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu
(This will take around 15 minutes)
If you want to see how many trees are non-projective before and after, you can use the script: conllu-extract-non-projective.py
python3 conllu-extract-non-projective.py < tr-ud-train.conllu.conllu > tr-ud-train.nonproj.conllu
Using external embeddings[edit]
For calculating the word embeddings we'll use word2vec
. UDPipe can directly use this kind of embedding file.
- Compile word2vec
git clone https://github.com/dav/word2vec.git cd word2vec/src make
Now copy the word2vec binaries somewhere in your $PATH.
- Get a corpus and tokenise
wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip unzip en-tr.txt.zip cat SETIMES2.en-tr.tr | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/ */ /g' > tur-tok.crp.txt
- Train word2vec
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \ -threads 12 -binary 0 -iter 15 -min-count 2
These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4—5 minutes to train.
- Now use the embeddings with UDpipe
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu
Classifier settings[edit]
- Increase the size of the hidden layer
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu
The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.
- Train for a different number of iterations/epochs
udpipe --tokenizer none --tagger none --parser "iterations=5" --train tur.iter5.udpipe < tr-ud-train.conllu
The default is 10, try some numbers in the range of [1, 15]
Parser combination[edit]
- Prerequisites
The important scripts for this section are:
conllu-voting.py
: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U filesconllu-eval.py
: Calculates LAS and UAS.
- Spanning tree algorithm
python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu
Now try it on your own data.
Using the CoNLL evaluation script[edit]
Open this link: http://universaldependencies.org/conll17/evaluation.html
Then download the evaluation script here.
Use the evaluation script like this:
$ python3 evaluation_script/conll17_ud_eval.py gold_treebank.conllu system_output.conllu
More ideas[edit]
- Try training only on projective data. You can use the script
conllu-extract-projective.py
to make a subset of your treebank that only has projective trees. - Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective.
- Find a bug in one of the scripts and report it on Github.
- Describe an alternative weighting method for parser combination, could better results be had with learnt weights ?