Difference between revisions of "Training perceptron tagger"
Line 41: | Line 41: | ||
<pre> |
<pre> |
||
python clean_muliwords.py eng.tagged eng.untagged.txt cleaned_eng.tagged |
python clean_muliwords.py eng.tagged eng.untagged.txt cleaned_eng.tagged |
||
</pre> |
|||
And last, you need to get ''.untagged'' file for cleaned dataset. |
|||
<pre> |
|||
cat cleaned_eng.tagged | cut -f2 -d'^' | cut -f1 -d'/' > cleaned_eng.tagged.txt |
|||
apertium -d ~/apertium-eng eng-morph cleaned_eng.tagged.txt cleaned_eng.untagged |
|||
</pre> |
|||
==Train tagger== |
|||
You need to copy ''mtx'' file into directory, yo can read about mtx in [http://wiki.apertium.org/wiki/Perceptron_tagger this page]. |
|||
Now you are ready to train the tagger. Run: |
|||
<pre> |
|||
apertium-tagger -xs 10 eng.prob cleaned_eng.tagged cleaned_eng.untagged apertium-lang.lang.mtx |
|||
</pre> |
</pre> |
Revision as of 19:38, 5 January 2018
In this article, I will describe the pipeline for learning the Perceptron Tagger.
Contents
Convert UD-Tree dataset into Apertium
Firstly, you need to convert .conllu format into apertium format, you need using this tool UdTree2Apertium.
First you need to get a raw Apertium file. Example for english:
cat en-ud-train.conllu | grep -e '^$' -e '^[0-9]' | cut -f2 | sed 's/$/¶/g' | apertium-destxt | lt-proc -w ~/source/apertium//languages/apertium-eng/eng.automorf.bin | apertium-retxt | sed 's/¶//g' > en-ud-train.apertium
Then you need to run this utility:
python3 converter.py tags/eng.csv en-ud-train.apertium en-ud-train.conllu eng.tagged
Preparing data for tagger
First, you need to extract raw text from your handtagged files. Run:
cat eng.tagged | cut -f2 -d'^' | cut -f1 -d'/' > eng.tagged.txt
Next, create the ambiguous tag file (a tagged text with all the possible options). Run:
apertium -d ~/apertium-eng eng-morph eng.tagged.txt eng.untagged
Delete multiwords sentences
Then we need clean dataset from multiwords token, for example:
^for years/for years<adv>$
You need to get clean untagged dataset (only tokens, without tags).
cat eng.untagged| cut -f2 -d'^' | cut -f1 -d'/' > eng.untagged.txt
Then, you must to use clean_multiwords.py from UdTree2Apertium
python clean_muliwords.py eng.tagged eng.untagged.txt cleaned_eng.tagged
And last, you need to get .untagged file for cleaned dataset.
cat cleaned_eng.tagged | cut -f2 -d'^' | cut -f1 -d'/' > cleaned_eng.tagged.txt apertium -d ~/apertium-eng eng-morph cleaned_eng.tagged.txt cleaned_eng.untagged
Train tagger
You need to copy mtx file into directory, yo can read about mtx in this page.
Now you are ready to train the tagger. Run:
apertium-tagger -xs 10 eng.prob cleaned_eng.tagged cleaned_eng.untagged apertium-lang.lang.mtx