Difference between revisions of "Supervised tagger training"

From Apertium
Jump to navigation Jump to search
(Created page with "Supervised tagger training is the manual way of training the tagger. It takes more time than unsupervised tagger training, but is also more effective. ==What you need== ...")
 
 
Line 11: Line 11:
 
This is a directory called <code>lang1-tagger-data</code>, for example, <code>en-tagger-data</code> in case of English. This directory should be inside the language pair directory. Create one if it is not there.
 
This is a directory called <code>lang1-tagger-data</code>, for example, <code>en-tagger-data</code> in case of English. This directory should be inside the language pair directory. Create one if it is not there.
   
===A corpus===
+
===A handtagged corpus===
   
This is the source text which you will use for training. For supervised training, it is recommended to have a large corpus of over 30,000 words. You can find some [https://svn.code.sf.net/p/apertium/svn/languages/ here]. Enter the directory corresponding the language you want to train with, and look for files ending with <code>.raw.txt</code>. Those won't usually be more than 2000 words each, but if you combine all raw files into a single file, you should get a corpus large enough. Do this, and save that one file as <code>lang1.tagged.txt</code> inside the directory <code>lang1-tagger-data</code>.
+
[https://svn.code.sf.net/p/apertium/svn/languages/ This repo] also contains many handtagged files for every language. These end with <code>.handtagged.txt</code>. In some cases, there may be more than one handtagged versions of each file. You can check the difference between these and choose the most correct one. Combine these chosen handtagged files into one file, and save it as <code>lang1.tagged</code> inside <code>lang1-tagger-data</code>.
   
 
==Procedure==
===A handtagged copy of that corpus===
 
   
  +
First, you need to extract raw text from your handtagged files. Enter <code>lang1-tagger-data</code> and run:
[https://svn.code.sf.net/p/apertium/svn/languages/ This repo] also contains the handtagged versions of each raw file. These end with <code>.handtagged.txt</code>. In some cases, there may be more than one handtagged versions of each file. You can check the difference between these and choose the most correct one. Combine these chosen handtagged files into one file, in the '''same order''' as you did with raw files. Save this as <code>lang1.tagged</code> inside <code>lang1-tagger-data</code>.
 
   
  +
<pre>
==Procedure==
 
  +
cat lang1.tagged | cut -f2 -d'^' | cut -f1 -d'/' > lang1.tagged.txt
  +
</pre>
   
 
Replace <code>lang1</code> with the corresponding language code.
Inside your language pair directory, run:
 
  +
 
Next, <code>cd ..</code> up into your language pair directory and run:
   
 
<pre>
 
<pre>
 
make -f lang1-lang2-supervised.make
 
make -f lang1-lang2-supervised.make
 
</pre>
 
</pre>
 
Replace <code>lang1</code> and <code>lang2</code> with the corresponding language codes.
 
   
 
If everything is set up correctly, this will generate a new file called <code>lang1.untagged</code> inside <code>lang1-tagger-data</code>. This file is the tagger's interpretation of your corpus -- the machine-tagged file (although it is named <code>untagged</code>).
 
If everything is set up correctly, this will generate a new file called <code>lang1.untagged</code> inside <code>lang1-tagger-data</code>. This file is the tagger's interpretation of your corpus -- the machine-tagged file (although it is named <code>untagged</code>).

Latest revision as of 13:44, 18 June 2014

Supervised tagger training is the manual way of training the tagger. It takes more time than unsupervised tagger training, but is also more effective.

What you need[edit]

A makefile[edit]

It is named like this: lang1-lang2-supervised.make, for example, en-eo-supervised.make. If you don't already have one in the language pair directory, you can copy this one from en-eo. You will need modify it to fit your language pair. This usually means editing the first few lines.

Tagger data directory[edit]

This is a directory called lang1-tagger-data, for example, en-tagger-data in case of English. This directory should be inside the language pair directory. Create one if it is not there.

A handtagged corpus[edit]

This repo also contains many handtagged files for every language. These end with .handtagged.txt. In some cases, there may be more than one handtagged versions of each file. You can check the difference between these and choose the most correct one. Combine these chosen handtagged files into one file, and save it as lang1.tagged inside lang1-tagger-data.

Procedure[edit]

First, you need to extract raw text from your handtagged files. Enter lang1-tagger-data and run:

cat lang1.tagged | cut -f2 -d'^' | cut -f1 -d'/' > lang1.tagged.txt

Replace lang1 with the corresponding language code.

Next, cd .. up into your language pair directory and run:

make -f lang1-lang2-supervised.make

If everything is set up correctly, this will generate a new file called lang1.untagged inside lang1-tagger-data. This file is the tagger's interpretation of your corpus -- the machine-tagged file (although it is named untagged).

Normally, the command above will end up with an error. It is fine. This happens because the words in lang1.tagged and lang1.untagged do not match each other. For example, a group of words could be a multiword according to the handtagged file, but not according to the machine-tagged file, or vice versa. This error will be solved only when the tagset in lang1.tagged is equivalent to one of the possible tagsets mentioned in lang1.untagged.

There are two common ways of solving this:

  • Edit the lang1.tagged file, so that it matches the tags expected by lang1.untagged.
  • Edit the dix, so that the tagger understands the new wordform and tags correctly the next time.

Which methods to choose, depends on the particular words in question, and it is up to you to decide. In any case, do not edit the lang1.untagged file. It is autogenerated, and all your changes will be lost anyway.

Once the mismatch is solved, run the above command again to check if it worked. If you get a different error from last time, it worked. Now keep solving all the mismatches until there are no more errors. When the execution of the command finishes without errors, the training is complete.