The perceptron part-of-speech tagger implements part-of-speech tagging using the averaged, structured perceptron algorithm. Some information about the implementation is available in this presentation. The implementation is based on the references in the final slide.
Step by step
Mostly things are as in Supervised tagger training except you need an MTX file (and optionally a TSX file) instead of a TSX file.
- Get an MTX file: Copy an MTX file into your language directory and optionally modify it (or start from scratch). See MTX format.
- Get a tagged corpus
- Train the tagger like so:
apertium-tagger [--skip-on-error] -xs [ITERATIONS] MODEL_FILE TAGGED_CORPUS UNTAGGED_CORPUS MTX_FILEwhich will write the resulting model to MODEL_FILE. You can put this in a Makefile. Use --skip-on-error to discard sentences for which the TAGGED and UNTAGGED corpus don't match (this can often happen as a result of the tagged corpus getting out of sync with the morphology). A reasonable value for ITERATIONS is 10.
- Run the tagger like so:
apertium-tagger --tagger --perceptron MODEL_FILE. You can put this in your modes.xml.
Getting more information
Getting detailed information about the operation of the tagger is useful both for debugging the tagger itself as well as for designing new feature templates.
|apertium-tagger --tagger --debug||Traces the tagging process.|
|apertium-perceptron-trace model MODEL_FILE||Output the data from MODEL_FILE including the feature bytecode/disassembly and the model weights.|
|apertium-perceptron-trace path MTX_FILE UNTAGGED_CORPUS TAGGED_CORPUS||Generates features for every possible wordoid as if tagging were taking place and outputting features from TAGGED_CORPUS.|
Speed: Some quick benchmarking with this method have revealed the two biggest bottlenecks might be copying stack values, which could be ameliorated by using reference counted pointers and coarsening tags, where there might be room to reuse some of the objects/machinery. In fact copying objects when using a reference (either managed or not) is a deficiency in other places too.