Difference between revisions of "Training perceptron tagger"

From Apertium
Jump to navigation Jump to search
Line 13: Line 13:
   
   
==Prepearing data for tagger==
+
==Preparing data for tagger==
 
First, you need to extract raw text from your handtagged files. Run:
 
First, you need to extract raw text from your handtagged files. Run:
   
Line 25: Line 25:
 
apertium -d ~/apertium-eng eng-morph eng.tagged.txt eng.untagged
 
apertium -d ~/apertium-eng eng-morph eng.tagged.txt eng.untagged
 
</pre>
 
</pre>
  +
  +
  +
==Delete multiwords sentences==

Revision as of 18:50, 5 January 2018

In this article, I will describe the pipeline for learning the Perceptron Tagger.


Convert UD-Tree dataset into Apertium

Firstly, you need to convert .conllu format into apertium format, you need using this tool UdTree2Apertium.

First you need to get a raw Apertium file. Example for english:

cat en-ud-train.conllu | grep -e '^$' -e '^[0-9]' | cut -f2 | sed 's/$/¶/g' | 
apertium-destxt | lt-proc -w ~/source/apertium//languages/apertium-eng/eng.automorf.bin | apertium-retxt | sed 's/¶//g' > en-ud-train.apertium

Then you need to run this utility:

python3 converter.py tags/eng.csv en-ud-train.apertium en-ud-train.conllu eng.tagged


Preparing data for tagger

First, you need to extract raw text from your handtagged files. Run:

cat eng.tagged | cut -f2 -d'^' | cut -f1 -d'/' > eng.tagged.txt

Next, create the ambiguous tag file (a tagged text with all the possible options). Run:

apertium -d ~/apertium-eng eng-morph eng.tagged.txt eng.untagged


Delete multiwords sentences