Difference between revisions of "UDPipe"

Revision as of 09:25, 25 March 2017

First things first

Get the code!

git clone https://github.com/ufal/udpipe
cd udpipe/src
make

Now copy the udpipe/src/udpipe binary executable to somewhere in your $PATH.

Get some data!

git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal
cd UD_Norwegian-Bokmaal

Train a default model

With tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --train nob.udpipe

Without tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --tokenizer none --tagger none --train nob.udpipe

Parse some input

With gold standard POS tags:

cat no_bokmaal-ud-dev.conllu  |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe  > output

Full pipeline:

echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe

Calculate accuracy

udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu

Get more stuff!

You'll need also need a couple of scripts from https://github.com/ftyers/ud-scripts

git clone https://github.com/ftyers/ud-scripts.git

Parameters

For playing with the parameters we're going to try a smaller treebank:

git clone https://github.com/UniversalDependencies/UD_Turkish
cd UD_Turkish

Default parsing options

udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu

Using the `swap` algorithm

If we want to support parsing non-projective trees we can use the swap algorithm:

udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu

(This will take around 15 minutes)

If you want to see how many trees are non-projective before and after, you can use the script: conllu-extract-non-projective.py

python3 conllu-extract-non-projective.py < yourfile.conllu > yourfile.nonproj.conllu

Using external embeddings

Compile word2vec

git clone https://github.com/dav/word2vec.git
cd word2vec/src
make

Now copy the word2vec binaries somewhere in your $PATH.

Get a corpus and tokenise

wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip
unzip en-tr.txt.zip
cat SETIMES2.en-tr.tr | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/  */ /g' > tur-tok.crp.txt

Train word2vec

word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \
 -threads 12 -binary 0 -iter 15 -min-count 2

These are the settings suggested by the UDpipe documentation.

Now use the embeddings with UDpipe

udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu

Classifier settings

Increase the size of the hidden layer

udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h400.udpipe < tr-ud-train.conllu

The default size is 200, try increasing it to 300 and see what happens.

Parser combination

Prerequisites

The important scripts for this section are:

conllu-voting.py: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U files
conllu-eval.py: Calculates LAS and UAS.

Spanning tree algorithm

python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu

Now try it on your own data.

More ideas

Try training only on projective data. You can use the script conllu-extract-projective.py to make a subset of your treebank that only has projective trees.

@@ Line 123: / Line 123: @@
 <pre>
-udpipe --tokenizer none --tagger none --parser "hidden_layer=400" --train tur.h400.udpipe < tr-ud-train.conllu
+udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h400.udpipe < tr-ud-train.conllu
 </pre>
-The default size is 200, try increasing it to 400 and see what happens.
+The default size is 200, try increasing it to 300 and see what happens.
 ==Parser combination==

Difference between revisions of "UDPipe"

Revision as of 09:25, 25 March 2017

Contents

First things first

Parameters

Default parsing options

Using the `swap` algorithm

Using external embeddings

Classifier settings

Parser combination

More ideas

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Difference between revisions of "UDPipe"

Revision as of 09:25, 25 March 2017

Contents

First things first

Parameters

Default parsing options

Using the swap algorithm

Using external embeddings

Classifier settings

Parser combination

More ideas

Navigation menu

Search

Using the `swap` algorithm