Difference between revisions of "UDPipe"

From Apertium
Jump to navigation Jump to search
Line 137: Line 137:


<pre>
<pre>
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h400.udpipe < tr-ud-train.conllu
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu
</pre>
</pre>


The default size is 200, try increasing it to 300 and see what happens.
The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.


;Train for a different number of iterations/epochs
;Train for a different number of iterations/epochs

Revision as of 10:04, 25 March 2017

First things first

Get the code!
git clone https://github.com/ufal/udpipe
cd udpipe/src
make

Now copy the udpipe/src/udpipe binary executable to somewhere in your $PATH.

Get some data!
git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal
cd UD_Norwegian-Bokmaal
Train a default model

With tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --train nob.udpipe                  

Without tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --tokenizer none --tagger none --train nob.udpipe                  
Parse some input

With gold standard POS tags:

cat no_bokmaal-ud-dev.conllu  |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe  > output               

Full pipeline:

echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe
Calculate accuracy
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu
Get more stuff!

You'll need also need a couple of scripts from https://github.com/ftyers/ud-scripts

git clone https://github.com/ftyers/ud-scripts.git

Parameters

For playing with the parameters we're going to try a smaller treebank:

git clone https://github.com/UniversalDependencies/UD_Turkish
cd UD_Turkish

The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options.

Default parsing options

First of all try training the parser with the default options. These are:

  • Parsing algorithm is projective
  • Number of training iterations is 10
  • Hidden layer size is 200
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu 

You can also download a pretrained model for Turkish trained using the default parsing options here:

Using the swap algorithm

If we want to support parsing non-projective trees we can use the swap algorithm:

udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu 

(This will take around 15 minutes)

If you want to see how many trees are non-projective before and after, you can use the script: conllu-extract-non-projective.py

python3 conllu-extract-non-projective.py < yourfile.conllu > yourfile.nonproj.conllu

Using external embeddings

For calculating the word embeddings we'll use word2vec. UDPipe can directly use this kind of embedding file.

Compile word2vec
git clone https://github.com/dav/word2vec.git
cd word2vec/src
make

Now copy the word2vec binaries somewhere in your $PATH.

Get a corpus and tokenise
wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip
unzip en-tr.txt.zip
cat SETIMES2.en-tr.tr | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/  */ /g' > tur-tok.crp.txt
Train word2vec
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \
 -threads 12 -binary 0 -iter 15 -min-count 2

These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4—5 minutes to train.

Now use the embeddings with UDpipe
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu 

Classifier settings

Increase the size of the hidden layer
udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu 

The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.

Train for a different number of iterations/epochs
udpipe --tokenizer none --tagger none --parser "iterations=5"  --train tur.iter5.udpipe < tr-ud-train.conllu 

The default is 10, try some numbers in the range of [1, 15]

Parser combination

Prerequisites

The important scripts for this section are:

  • conllu-voting.py: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U files
  • conllu-eval.py: Calculates LAS and UAS.
Spanning tree algorithm
python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu

Now try it on your own data.

More ideas

  • Try training only on projective data. You can use the script conllu-extract-projective.py to make a subset of your treebank that only has projective trees.
  • Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective.