Difference between revisions of "UDPipe"
Jump to navigation
Jump to search
Using the
Line 89: | Line 89: | ||
; Train word2vec: |
; Train word2vec: |
||
<pre> |
<pre> |
||
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 |
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \ |
||
-threads 12 -binary 0 -iter 15 -min-count 2 |
|||
</pre> |
</pre> |
||
Line 95: | Line 96: | ||
<pre> |
<pre> |
||
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu |
|||
</pre> |
</pre> |
||
Revision as of 14:03, 23 March 2017
Contents
First things first
- Get the code!
git clone https://github.com/ufal/udpipe cd udpipe/src make
Now copy the udpipe/src/udpipe
binary executable to somewhere in your $PATH
.
- Get some data!
git clone https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal cd UD_Norwegian-Bokmaal
- Train a default model
With tokeniser and tagger:
cat no_bokmaal-ud-train.conllu | udpipe --train nob.udpipe
Without tokeniser and tagger:
cat no_bokmaal-ud-train.conllu | udpipe --tokenizer none --tagger none --train nob.udpipe
- Parse some input
With gold standard POS tags:
cat no_bokmaal-ud-dev.conllu |cut -f1-6 | sed 's/$/\t_\t_\t_\t_/g' | sed 's/^\t.*//g'| udpipe --parse nob.udpipe > output
Full pipeline:
echo "Det ligger en bok på bordet." | udpipe --tokenize --tag --parse nob.udpipe
- Calculate accuracy
udpipe --accuracy --parse nob.udpipe no_bokmaal-ud-dev.conllu
Parameters
For playing with the parameters we're going to try a smaller treebank:
git clone https://github.com/UniversalDependencies/UD_Turkish cd UD_Turkish
Default parsing options
udpipe --tokenizer none --tagger none --train tur.proj.udpipe < UD_Turkish/tr-ud-train.conllu
Using the swap
algorithm
If we want to support parsing non-projective trees we can use the swap
algorithm:
udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tur.swap.udpipe < tr-ud-train.conllu
(This will take around 15 minutes)
Using external embeddings
- Compile word2vec
git clone https://github.com/dav/word2vec.git cd word2vec/src make
- Get a corpus and tokenise
cat tur.crp.txt | sed 's/[\[,;:!\]?"“”(){}]/ & /g' | sed 's/ */ /g' > tur-tok.crp.txt
- Train word2vec
word2vec -train tur-tok.crp.txt -output tur.vec -cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 \ -threads 12 -binary 0 -iter 15 -min-count 2
- Now use the embeddings with UDpipe
udpipe --tokenizer none --tagger none --parser "embedding_form_file=tur.vec" --train tur.embeds.udpipe < tr-ud-train.conllu