Perceptron tagger

From Apertium
Jump to navigation Jump to search

The perceptron part-of-speech tagger implements part-of-speech tagging using the averaged, structured perceptron algorithm. Some information about the implementation is available in this presentation. The implementation is based on the references in the final slide.

What you need[edit]

Training directory[edit]

While training can be done directly in the language directory, it is a better idea to train the tagger with copies of the files in another directory.

A handtagged corpus[edit]

This repo also contains many handtagged files for every language. These end with .handtagged.txt. In some cases, there may be more than one handtagged versions of each file. You can check the difference between these and choose the most correct one. Combine these chosen handtagged files into one file, and save it as lang.tagged inside the training directory (replace lang with the corresponding language code).

A MTX file[edit]

The perceptron tagger uses a MTX file to define macros and operations with wordoids. If your language does not have a MTX yet, you can get one from here. spacycoarsetags.mtx is a good start (make sure you modify it to point to a TSX file). Save it as apertium-lang.lang.mtx in your training directory.

For more information on the MTX format, see MTX format.

A morphological analyzer[edit]

Compile the morphological analyzer for your language and save it in the training directory as lang.automorf.bin

Procedure[edit]

First, you need to extract raw text from your handtagged files. Run:

cat lang.tagged | cut -f2 -d'^' | cut -f1 -d'/' > lang.tagged.txt

Next, create the ambiguous tag file (a tagged text with all the possible options). Run:

cat lang.tagged.txt | lt-proc -w 'lang.automorf.bin' > lang.untagged

Now you are ready to train the tagger. Run:

apertium-tagger [--skip-on-error] -xs [ITERATIONS] lang.prob lang.tagged lang.untagged apertium-lang.lang.mtx

This will generate the .prob file for your language. Use --skip-on-error to discard sentences for which the tagged and untagged corpus do not match (this can often happen as a result of the tagged corpus getting out of sync with the morphology). A reasonable value for ITERATIONS is 10.

If your tagged and untagged files are not aligned, the training process will fail. You must then edit the handtagged file to reflect the correct tags within the set of possibilities available in the tags generated automatically by Apertium. Never edit the Apertium-generated file!

Keep editing the handtagged until everything is fully aligned or the number of skipped sentences is very low (by using --skip-on-error). Congratulations, you have trained the tagger!

Using the perceptron tagger[edit]

Once the tagger has been trained, you can use it in the pipeline like this:

apertium-tagger -gx lang.prob

Getting more information[edit]

Getting detailed information about the operation of the tagger is useful both for debugging the tagger itself as well as for designing new feature templates.

Tool Description
apertium-tagger --tagger --debug Traces the tagging process.
apertium-perceptron-trace model MODEL_FILE Output the data from MODEL_FILE including the feature bytecode/disassembly and the model weights.
apertium-perceptron-trace path MTX_FILE UNTAGGED_CORPUS TAGGED_CORPUS Generates features for every possible wordoid as if tagging were taking place and outputting features from TAGGED_CORPUS.

Potential improvements[edit]

Speed: Some quick benchmarking with this method have revealed the two biggest bottlenecks might be copying stack values, which could be ameliorated by using reference counted pointers and coarsening tags, where there might be room to reuse some of the objects/machinery. In fact copying objects when using a reference (either managed or not) is a deficiency in other places too.