Difference between revisions of "UDPipe"

From Apertium
Jump to navigation Jump to search
(minor fix: a word was repeated accidentally)
 
(17 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{TOCD}}

==First things first==
==First things first==

This section describes how to set up and train/test UDPipe. If you've already done this, then you can move on to the next part.


;Get the code!
;Get the code!
Line 49: Line 51:
;Get more stuff!
;Get more stuff!


You'll need also need a couple of scripts from https://github.com/ftyers/ud-scripts
You'll also need a couple of scripts from https://github.com/ftyers/ud-scripts


<pre>
<pre>
Line 63: Line 65:
cd UD_Turkish
cd UD_Turkish
</pre>
</pre>

The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options.


===Default parsing options===
===Default parsing options===

First of all try training the parser with the default options. These are:

* Parsing algorithm is <code>projective</code>
* Number of training iterations is 10
* Hidden layer size is 200


<pre>
<pre>
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu
</pre>
</pre>

You can also download a pretrained model for Turkish trained using the default parsing options here:

* http://ilazki.thinkgeek.co.uk/tur.proj.udpipe


===Using the <code>swap</code> algorithm===
===Using the <code>swap</code> algorithm===
Line 83: Line 97:


<pre>
<pre>
python3 conllu-extract-non-projective.py < yourfile.conllu > yourfile.nonproj.conllu
python3 conllu-extract-non-projective.py < tr-ud-train.conllu.conllu > tr-ud-train.nonproj.conllu
</pre>
</pre>


===Using external embeddings===
===Using external embeddings===

For calculating the word embeddings we'll use <code>word2vec</code>. UDPipe can directly use this kind of embedding file.


; Compile word2vec
; Compile word2vec
Line 110: Line 126:
</pre>
</pre>


These are the settings suggested by the UDpipe documentation.
These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4&mdash;5 minutes to train.


;Now use the embeddings with UDpipe
;Now use the embeddings with UDpipe
Line 123: Line 139:


<pre>
<pre>
udpipe --tokenizer none --tagger none --parser "hidden_layer=400" --train tur.h400.udpipe < tr-ud-train.conllu
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu
</pre>
</pre>


The default size is 200, try increasing it to 400 and see what happens.
The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.

;Train for a different number of iterations/epochs

<pre>
udpipe --tokenizer none --tagger none --parser "iterations=5" --train tur.iter5.udpipe < tr-ud-train.conllu
</pre>

The default is 10, try some numbers in the range of [1, 15]


==Parser combination==
==Parser combination==
Line 144: Line 168:


Now try it on your own data.
Now try it on your own data.

==Using the CoNLL evaluation script==

Open this link: http://universaldependencies.org/conll17/evaluation.html

Then download the evaluation script [http://universaldependencies.org/conll17/eval.zip here].

Use the evaluation script like this:

<pre>
$ python3 evaluation_script/conll17_ud_eval.py gold_treebank.conllu system_output.conllu
</pre>

==More ideas==

* Try training only on projective data. You can use the script <code>conllu-extract-projective.py</code> to make a subset of your treebank that only has projective trees.
* Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective.
* Find a bug in one of the scripts and report it on Github.
* Describe an alternative weighting method for parser combination, could better results be had with learnt weights ?


[[Category:Tools|*]]
[[Category:Tools|*]]

Latest revision as of 19:43, 9 March 2020

First things first[edit]

This section describes how to set up and train/test UDPipe. If you've already done this, then you can move on to the next part.

Get the code!
git clone https://github.com/ufal/udpipe
cd udpipe/src
make

Now copy the udpipe/src/udpipe binary executable to somewhere in your $PATH.

Get some data!
git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal
cd UD_Norwegian-Bokmaal
Train a default model

With tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --train nob.udpipe                  

Without tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --tokenizer none --tagger none --train nob.udpipe                  
Parse some input

With gold standard POS tags:

cat no_bokmaal-ud-dev.conllu  |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe  > output               

Full pipeline:

echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe
Calculate accuracy
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu
Get more stuff!

You'll also need a couple of scripts from https://github.com/ftyers/ud-scripts

git clone https://github.com/ftyers/ud-scripts.git

Parameters[edit]

For playing with the parameters we're going to try a smaller treebank:

git clone https://github.com/UniversalDependencies/UD_Turkish
cd UD_Turkish

The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options.

Default parsing options[edit]

First of all try training the parser with the default options. These are:

  • Parsing algorithm is projective
  • Number of training iterations is 10
  • Hidden layer size is 200
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu 

You can also download a pretrained model for Turkish trained using the default parsing options here:

Using the swap algorithm[edit]

If we want to support parsing non-projective trees we can use the swap algorithm:

udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu 

(This will take around 15 minutes)

If you want to see how many trees are non-projective before and after, you can use the script: conllu-extract-non-projective.py

python3 conllu-extract-non-projective.py < tr-ud-train.conllu.conllu > tr-ud-train.nonproj.conllu

Using external embeddings[edit]

For calculating the word embeddings we'll use word2vec. UDPipe can directly use this kind of embedding file.

Compile word2vec
git clone https://github.com/dav/word2vec.git
cd word2vec/src
make

Now copy the word2vec binaries somewhere in your $PATH.

Get a corpus and tokenise
wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip
unzip en-tr.txt.zip
cat SETIMES2.en-tr.tr | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/  */ /g' > tur-tok.crp.txt
Train word2vec
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \
 -threads 12 -binary 0 -iter 15 -min-count 2

These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4—5 minutes to train.

Now use the embeddings with UDpipe
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu 

Classifier settings[edit]

Increase the size of the hidden layer
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu 

The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.

Train for a different number of iterations/epochs
udpipe --tokenizer none --tagger none --parser "iterations=5"  --train tur.iter5.udpipe < tr-ud-train.conllu 

The default is 10, try some numbers in the range of [1, 15]

Parser combination[edit]

Prerequisites

The important scripts for this section are:

  • conllu-voting.py: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U files
  • conllu-eval.py: Calculates LAS and UAS.
Spanning tree algorithm
python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu

Now try it on your own data.

Using the CoNLL evaluation script[edit]

Open this link: http://universaldependencies.org/conll17/evaluation.html

Then download the evaluation script here.

Use the evaluation script like this:

$ python3 evaluation_script/conll17_ud_eval.py gold_treebank.conllu system_output.conllu

More ideas[edit]

  • Try training only on projective data. You can use the script conllu-extract-projective.py to make a subset of your treebank that only has projective trees.
  • Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective.
  • Find a bug in one of the scripts and report it on Github.
  • Describe an alternative weighting method for parser combination, could better results be had with learnt weights ?