UDPipe

First things first

This section describes how to set up and train/test UDPipe. If you've already done this, then you can move on to the next part.

Get the code!

git clone https://github.com/ufal/udpipe
cd udpipe/src
make

Now copy the udpipe/src/udpipe binary executable to somewhere in your $PATH.

Get some data!

git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal
cd UD_Norwegian-Bokmaal

Train a default model

With tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --train nob.udpipe

Without tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --tokenizer none --tagger none --train nob.udpipe

Parse some input

With gold standard POS tags:

cat no_bokmaal-ud-dev.conllu  |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe  > output

Full pipeline:

echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe

Calculate accuracy

udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu

Get more stuff!

You'll need also need a couple of scripts from https://github.com/ftyers/ud-scripts

git clone https://github.com/ftyers/ud-scripts.git

Parameters

For playing with the parameters we're going to try a smaller treebank:

git clone https://github.com/UniversalDependencies/UD_Turkish
cd UD_Turkish

The Turkish UD treebank has a fairly high number of non-projective dependencies, in the order of 15%, so it makes a good test case for testing different options.

Default parsing options

First of all try training the parser with the default options. These are:

Parsing algorithm is projective
Number of training iterations is 10
Hidden layer size is 200

udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu

You can also download a pretrained model for Turkish trained using the default parsing options here:

http://ilazki.thinkgeek.co.uk/tur.proj.udpipe

Using the `swap` algorithm

If we want to support parsing non-projective trees we can use the swap algorithm:

udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu

(This will take around 15 minutes)

If you want to see how many trees are non-projective before and after, you can use the script: conllu-extract-non-projective.py

python3 conllu-extract-non-projective.py < yourfile.conllu > yourfile.nonproj.conllu

Using external embeddings

For calculating the word embeddings we'll use word2vec. UDPipe can directly use this kind of embedding file.

Compile word2vec

git clone https://github.com/dav/word2vec.git
cd word2vec/src
make

Now copy the word2vec binaries somewhere in your $PATH.

Get a corpus and tokenise

wget http://opus.lingfil.uu.se/download.php?f=SETIMES2/en-tr.txt.zip -O en-tr.txt.zip
unzip en-tr.txt.zip
cat SETIMES2.en-tr.tr | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/  */ /g' > tur-tok.crp.txt

Train word2vec

word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \
 -threads 12 -binary 0 -iter 15 -min-count 2

These are the settings suggested by the UDpipe documentation. It shouldn't take more than 4—5 minutes to train.

Now use the embeddings with UDpipe

udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu

Classifier settings

Increase the size of the hidden layer

udpipe --tokenizer none --tagger none --parser "hidden_layer=300" --train tur.h300.udpipe < tr-ud-train.conllu

The default size is 200, try increasing it to 300, or decreasing it to 100 and see what happens.

Train for a different number of iterations/epochs

udpipe --tokenizer none --tagger none --parser "iterations=5"  --train tur.iter5.udpipe < tr-ud-train.conllu

The default is 10, try some numbers in the range of [1, 15]

Parser combination

Prerequisites

The important scripts for this section are:

conllu-voting.py: Runs Chu-Liu-Edmonds over a weighted graph assembled from CoNLL-U files
conllu-eval.py: Calculates LAS and UAS.

Spanning tree algorithm

python3 conllu-voting.py samples/example.1.0.conllu samples/example.1.1.conllu samples/example.1.2.conllu samples/example.1.3.conllu

Now try it on your own data.

More ideas

Try training only on projective data. You can use the script conllu-extract-projective.py to make a subset of your treebank that only has projective trees.
Download a treebank of a language you are interested in, and find out what percentage of trees are non-projective.
Find a bug in one of the scripts and report it on Github.
Describe an alternative weighting method for parser combination, could better results be had with learnt weights ?

UDPipe

Contents

First things first

Parameters

Default parsing options

Using the `swap` algorithm

Using external embeddings

Classifier settings

Parser combination

More ideas

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

UDPipe

Contents

First things first

Parameters

Default parsing options

Using the swap algorithm

Using external embeddings

Classifier settings

Parser combination

More ideas

Navigation menu

Search

Using the `swap` algorithm