Difference between revisions of "UDPipe"

From Apertium
Jump to navigation Jump to search
Line 84: Line 84:
 
; Get a corpus and tokenise
 
; Get a corpus and tokenise
 
<pre>
 
<pre>
cat tur.crp.txt | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/ */ /g' > /tmp/tur-tok.crp.txt
+
cat tur.crp.txt | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/ */ /g' > tur-tok.crp.txt
 
</pre>
 
</pre>
   
Line 90: Line 90:
 
<pre>
 
<pre>
 
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 -threads 12 -binary 0 -iter 15 -min-count 2
 
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 -threads 12 -binary 0 -iter 15 -min-count 2
  +
</pre>
  +
  +
;Now use the embeddings with UDpipe
  +
  +
<pre>
  +
 
</pre>
 
</pre>
   

Revision as of 14:01, 23 March 2017

First things first

Get the code!
git clone https://github.com/ufal/udpipe
cd udpipe/src
make

Now copy the udpipe/src/udpipe binary executable to somewhere in your $PATH.

Get some data!
git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal
cd UD_Norwegian-Bokmaal
Train a default model

With tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --train nob.udpipe                  

Without tokeniser and tagger:

cat no_bokmaal-ud-train.conllu | udpipe  --tokenizer none --tagger none --train nob.udpipe                  
Parse some input

With gold standard POS tags:

cat no_bokmaal-ud-dev.conllu  |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe  > output               

Full pipeline:

echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe
Calculate accuracy
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu

Parameters

For playing with the parameters we're going to try a smaller treebank:

git clone https://github.com/UniversalDependencies/UD_Turkish
cd UD_Turkish

Default parsing options

udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu 

Using the swap algorithm

If we want to support parsing non-projective trees we can use the swap algorithm:

udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu 

(This will take around 15 minutes)

Using external embeddings

Compile word2vec
git clone https://github.com/dav/word2vec.git
cd word2vec/src
make

Get a corpus and tokenise
cat tur.crp.txt | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/  */ /g' > tur-tok.crp.txt
Train word2vec
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 -threads 12 -binary 0 -iter 15 -min-count 2
Now use the embeddings with UDpipe