Difference between revisions of "Perceptron tagger"

From Apertium
Jump to navigation Jump to search
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
The perceptron part-of-speech tagger implements part-of-speech tagging using the averaged, structured perceptron algorithm. Some information about the implementation is available [https://github.com/frankier/perceptron-tagger-slides/raw/master/presentation.pdf in this presentation]. The implementation is based on the references in the final slide.
 
The perceptron part-of-speech tagger implements part-of-speech tagging using the averaged, structured perceptron algorithm. Some information about the implementation is available [https://github.com/frankier/perceptron-tagger-slides/raw/master/presentation.pdf in this presentation]. The implementation is based on the references in the final slide.
   
== Step by step ==
+
==What you need==
   
  +
===Training directory===
Mostly things are as in [[Supervised tagger training]] except you need an MTX file (and optionally a TSX file) instead of a TSX file.
 
   
  +
While training can be done directly in the language directory, it is a better idea to train the tagger with copies of the files in another directory.
#
 
  +
## '''Get an MTX file''': Copy an MTX file into your language directory and optionally modify it (or start from scratch). See [[MTX format]].
 
## '''Get a tagged corpus'''
+
===A handtagged corpus===
  +
# '''Train the tagger''' like so: `apertium-tagger [--skip-on-error] -xs [ITERATIONS] MODEL_FILE TAGGED_CORPUS UNTAGGED_CORPUS MTX_FILE` which will write the resulting model to MODEL_FILE. You can put this in a Makefile. Use --skip-on-error to discard sentences for which the TAGGED and UNTAGGED corpus don't match (this can often happen as a result of the tagged corpus getting out of sync with the morphology). A reasonable value for ITERATIONS is 10.
 
  +
[https://svn.code.sf.net/p/apertium/svn/languages/ This repo] also contains many handtagged files for every language. These end with <code>.handtagged.txt</code>. In some cases, there may be more than one handtagged versions of each file. You can check the difference between these and choose the most correct one. Combine these chosen handtagged files into one file, and save it as <code>lang.tagged</code> inside the training directory (replace <code>lang</code> with the corresponding language code).
# '''Run the tagger''' like so: `apertium-tagger --tagger --perceptron MODEL_FILE`. You can put this in your modes.xml.
 
  +
  +
===A MTX file===
  +
  +
The perceptron tagger uses a MTX file to define macros and operations with wordoids. If your language does not have a MTX yet, you can get one from [https://svn.code.sf.net/p/apertium/svn/branches/apertium-tagger/experiments/mtx/ here]. <code>spacycoarsetags.mtx</code> is a good start (make sure you modify it to point to a TSX file). Save it as <code>apertium-lang.lang.mtx</code> in your training directory.
  +
  +
For more information on the MTX format, see [[MTX format]].
  +
  +
===A morphological analyzer===
  +
  +
Compile the morphological analyzer for your language and save it in the training directory as <code>lang.automorf.bin</code>
  +
  +
== Procedure ==
  +
  +
First, you need to extract raw text from your handtagged files. Run:
  +
  +
<pre>
  +
cat lang.tagged | cut -f2 -d'^' | cut -f1 -d'/' > lang.tagged.txt
  +
</pre>
  +
  +
Next, create the ambiguous tag file (a tagged text with all the possible options). Run:
  +
  +
<pre>
  +
cat lang.tagged.txt | lt-proc -w 'lang.automorf.bin' > lang.untagged
  +
</pre>
  +
  +
Now you are ready to train the tagger. Run:
  +
  +
<pre>
  +
apertium-tagger [--skip-on-error] -xs [ITERATIONS] lang.prob lang.tagged lang.untagged apertium-lang.lang.mtx
  +
</pre>
  +
 
This will generate the .prob file for your language. Use --skip-on-error to discard sentences for which the tagged and untagged corpus do not match (this can often happen as a result of the tagged corpus getting out of sync with the morphology). A reasonable value for ITERATIONS is 10.
  +
  +
If your tagged and untagged files are not aligned, the training process will fail. You must then edit the handtagged file to reflect the correct tags within the set of possibilities available in the tags generated automatically by Apertium. Never edit the Apertium-generated file!
  +
  +
Keep editing the handtagged until everything is fully aligned or the number of skipped sentences is very low (by using --skip-on-error). Congratulations, you have trained the tagger!
  +
  +
== Using the perceptron tagger ==
  +
  +
Once the tagger has been trained, you can use it in the pipeline like this:
  +
  +
<pre>
  +
apertium-tagger -gx lang.prob
  +
</pre>
  +
  +
== Getting more information ==
  +
  +
Getting detailed information about the operation of the tagger is useful both for debugging the tagger itself as well as for designing new feature templates.
  +
  +
{| class="wikitable"
  +
! Tool
  +
! Description
  +
|-
  +
| apertium-tagger --tagger --debug
  +
| Traces the tagging process.
  +
|-
  +
| apertium-perceptron-trace model MODEL_FILE
  +
| Output the data from MODEL_FILE including the feature bytecode/disassembly and the model weights.
  +
|-
  +
| apertium-perceptron-trace path MTX_FILE UNTAGGED_CORPUS TAGGED_CORPUS
  +
| Generates features for every possible wordoid as if tagging were taking place and outputting features from TAGGED_CORPUS.
  +
|}
  +
  +
== Potential improvements ==
  +
  +
'''Speed''': Some quick benchmarking with [http://stackoverflow.com/a/378024/678387 this method] have revealed the two biggest bottlenecks might be copying stack values, which could be ameliorated by using reference counted pointers and coarsening tags, where there might be room to reuse some of the objects/machinery. In fact copying objects when using a reference (either managed or not) is a deficiency in other places too.
  +
  +
[[Category:Documentation in English]]

Latest revision as of 13:36, 23 August 2017

The perceptron part-of-speech tagger implements part-of-speech tagging using the averaged, structured perceptron algorithm. Some information about the implementation is available in this presentation. The implementation is based on the references in the final slide.

What you need[edit]

Training directory[edit]

While training can be done directly in the language directory, it is a better idea to train the tagger with copies of the files in another directory.

A handtagged corpus[edit]

This repo also contains many handtagged files for every language. These end with .handtagged.txt. In some cases, there may be more than one handtagged versions of each file. You can check the difference between these and choose the most correct one. Combine these chosen handtagged files into one file, and save it as lang.tagged inside the training directory (replace lang with the corresponding language code).

A MTX file[edit]

The perceptron tagger uses a MTX file to define macros and operations with wordoids. If your language does not have a MTX yet, you can get one from here. spacycoarsetags.mtx is a good start (make sure you modify it to point to a TSX file). Save it as apertium-lang.lang.mtx in your training directory.

For more information on the MTX format, see MTX format.

A morphological analyzer[edit]

Compile the morphological analyzer for your language and save it in the training directory as lang.automorf.bin

Procedure[edit]

First, you need to extract raw text from your handtagged files. Run:

cat lang.tagged | cut -f2 -d'^' | cut -f1 -d'/' > lang.tagged.txt

Next, create the ambiguous tag file (a tagged text with all the possible options). Run:

cat lang.tagged.txt | lt-proc -w 'lang.automorf.bin' > lang.untagged

Now you are ready to train the tagger. Run:

apertium-tagger [--skip-on-error] -xs [ITERATIONS] lang.prob lang.tagged lang.untagged apertium-lang.lang.mtx

This will generate the .prob file for your language. Use --skip-on-error to discard sentences for which the tagged and untagged corpus do not match (this can often happen as a result of the tagged corpus getting out of sync with the morphology). A reasonable value for ITERATIONS is 10.

If your tagged and untagged files are not aligned, the training process will fail. You must then edit the handtagged file to reflect the correct tags within the set of possibilities available in the tags generated automatically by Apertium. Never edit the Apertium-generated file!

Keep editing the handtagged until everything is fully aligned or the number of skipped sentences is very low (by using --skip-on-error). Congratulations, you have trained the tagger!

Using the perceptron tagger[edit]

Once the tagger has been trained, you can use it in the pipeline like this:

apertium-tagger -gx lang.prob

Getting more information[edit]

Getting detailed information about the operation of the tagger is useful both for debugging the tagger itself as well as for designing new feature templates.

Tool Description
apertium-tagger --tagger --debug Traces the tagging process.
apertium-perceptron-trace model MODEL_FILE Output the data from MODEL_FILE including the feature bytecode/disassembly and the model weights.
apertium-perceptron-trace path MTX_FILE UNTAGGED_CORPUS TAGGED_CORPUS Generates features for every possible wordoid as if tagging were taking place and outputting features from TAGGED_CORPUS.

Potential improvements[edit]

Speed: Some quick benchmarking with this method have revealed the two biggest bottlenecks might be copying stack values, which could be ameliorated by using reference counted pointers and coarsening tags, where there might be room to reuse some of the objects/machinery. In fact copying objects when using a reference (either managed or not) is a deficiency in other places too.