Ideas for Google Summer of Code/Improvements to target-language tagger training

From Apertium
Jump to navigation Jump to search

Enhance source segmentation used during target-language tagger training and improve the disambiguation path pruning algorithm

apertium-tagger-training-tools is a program for doing target-language tagger training. This means that it tunes the parameters of an HMM model based on the quality of the translations through the whole system. To do so, it segments the source-language training corpus by taking into account the patterns detected by the structural transfer module, and translates to the target language all possible disambiguation paths of each source-language segment. To avoid translating all possible disambiguation paths to the target language a pruning method based on the a-priori likelihood of each disambiguation paths is implemented. Note however, that this running method requires computing the a-priori likelihood of all possible disambiguations.

The project consists of two parts.

  • The first part consists of making apertium-tagger-training-tools able to segment using any-level of structural transfer rules (right now, it only "understands" one-level, shallow-transfer rules).
  • The second part consists of implementing a k-best Viterbi algorithm to avoid computing the a-priori likelihood of all paths before pruning; in this way only the k-best disambiguation paths are translated to the target language.

This task would also require switching the default perl-based language model to either IRSTLM or RandLM (or both!).

Further reading: