Difference between revisions of "Training perceptron tagger"

Latest revision as of 19:39, 5 January 2018

In this article, I will describe the pipeline for learning the Perceptron Tagger.

Convert UD-Tree dataset into Apertium[edit]

Firstly, you need to convert .conllu format into apertium format, you need using this tool UdTree2Apertium.

First you need to get a raw Apertium file. Example for english:

cat en-ud-train.conllu | grep -e '^$' -e '^[0-9]' | cut -f2 | sed 's/$/¶/g' | 
apertium-destxt | lt-proc -w ~/source/apertium//languages/apertium-eng/eng.automorf.bin | apertium-retxt | sed 's/¶//g' > en-ud-train.apertium

Then you need to run this utility:

python3 converter.py tags/eng.csv en-ud-train.apertium en-ud-train.conllu eng.tagged

Preparing data for tagger[edit]

First, you need to extract raw text from your handtagged files. Run:

cat eng.tagged | cut -f2 -d'^' | cut -f1 -d'/' > eng.tagged.txt

Next, create the ambiguous tag file (a tagged text with all the possible options). Run:

apertium -d ~/apertium-eng eng-morph eng.tagged.txt eng.untagged

Delete multiwords sentences[edit]

Then we need clean dataset from multiwords token, for example:

^for years/for years<adv>$

You need to get clean untagged dataset (only tokens, without tags).

cat eng.untagged| cut -f2 -d'^' | cut -f1 -d'/' > eng.untagged.txt

Then, you must to use clean_multiwords.py from UdTree2Apertium

python clean_muliwords.py eng.tagged eng.untagged.txt cleaned_eng.tagged

And last, you need to get .untagged file for cleaned dataset.

cat cleaned_eng.tagged | cut -f2 -d'^' | cut -f1 -d'/' > cleaned_eng.tagged.txt
apertium -d ~/apertium-eng eng-morph cleaned_eng.tagged.txt cleaned_eng.untagged

Train tagger[edit]

You need to copy mtx file into directory, yo can read about mtx in this page.

Now you are ready to train the tagger. Run:

apertium-tagger -xs 10 eng.prob cleaned_eng.tagged cleaned_eng.untagged apertium-lang.lang.mtx

Support[edit]

If you have any questions about this pipeline, you can write me: alxmamaev@ya.ru

Difference between revisions of "Training perceptron tagger"

Latest revision as of 19:39, 5 January 2018

Contents

Convert UD-Tree dataset into Apertium[edit]

Preparing data for tagger[edit]

Delete multiwords sentences[edit]

Train tagger[edit]

Support[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 13: / Line 13: @@
-==Prepearing data for tagger==
+==Preparing data for tagger==
 First, you need to extract raw text from your handtagged files. Run:
@@ Line 25: / Line 25: @@
 apertium -d ~/apertium-eng eng-morph eng.tagged.txt eng.untagged
 </pre>
+==Delete multiwords sentences==
+Then we need clean dataset from multiwords token, for example:
+<pre>
+^for years/for years<adv>$
+</pre>
+You need to get clean untagged dataset (only tokens, without tags).
+<pre>
+cat eng.untagged| cut -f2 -d'^' | cut -f1 -d'/' > eng.untagged.txt
+</pre>
+Then, you must to use ''clean_multiwords.py'' from UdTree2Apertium
+<pre>
+python clean_muliwords.py eng.tagged eng.untagged.txt cleaned_eng.tagged
+</pre>
+And last, you need to get ''.untagged'' file for cleaned dataset.
+<pre>
+cat cleaned_eng.tagged | cut -f2 -d'^' | cut -f1 -d'/' > cleaned_eng.tagged.txt
+apertium -d ~/apertium-eng eng-morph cleaned_eng.tagged.txt cleaned_eng.untagged
+</pre>
+==Train tagger==
+You need to copy ''mtx'' file into directory, yo can read about mtx in [http://wiki.apertium.org/wiki/Perceptron_tagger this page].
+Now you are ready to train the tagger. Run:
+<pre>
+apertium-tagger -xs 10 eng.prob cleaned_eng.tagged cleaned_eng.untagged apertium-lang.lang.mtx
+</pre>
+==Support==
+If you have any questions about this pipeline, you can write me: alxmamaev@ya.ru