Ideas for Google Summer of Code/User-friendly lexical selection training

From Apertium
< Ideas for Google Summer of Code
Revision as of 13:46, 29 March 2021 by Unhammer (talk | contribs) (→‎Frequently asked questions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Our bilingual dictionaries allow for ambiguous translations, selecting the right one in a context is handled by our Lexical selection module apertium-lex-tools. We can either write rules manually, or word-align and train on a corpus to infer rules automatically. Unfortunately, the procedure for training is a bit messy, with various scripts involved that require lots of manual tweaking, and many third party tools to be installed, e.g. irstlm, moses, gizapp.

The goal of this task is to make the training procedure as streamlined and user-friendly as possible. Ideally, we should only have to write a simple config file, and a driver script would take care of the rest. There should also be regression tests on the driver script, to ensure it works in the face of updates to third-party tools.

For some documentation, see Lexical selection and onwards:

To get a feel for how lexical selection is used in translation, read How to get started with lexical selection rules, although that is more aimed at the language pair developer writing rules manually.


  • Create a simple config format (e.g. toml-based) that includes all necessary information for the training process
  • Create a driver script that will:
    • validate configuration
    • ensure third party tools are downloaded, configured and built
    • preprocess corpora
    • run training
    • finally produce an .lrx file
    • and preferably allow for evaluation of the .lrx file on a held-out test corpus
    • do this for both parallel corpora (with giza) and non-parallel corpora (just irstlm)
  • Create regression tests for driver script
  • Dog-food the work:
    • run the training on language pairs that don't have (many) lexical selection rules
    • check if it improves quality (using parallell corpora)
    • if it does, add rules to the pair (in cooperation with pair maintainers)

Coding challenges[edit]

  • Make a simple program to read a config file, check that it's valid and output some values from the config
  • Word-align a bilingual corpus with moses+giza
    • Or use fastalign – probably better / easier
  • Run lexical selection training for a language pair

Frequently asked questions[edit]

  • none yet, ask us something! :)

In the meanwhile, here's the top-level overview of training:

spectie[m] It's basically parallel corpus  [15:43]
spectie[m] One side is -tagger output
spectie[m] Other side is -biltrans output
spectie[m] Remove unnecessary tags using the bidix
spectie[m] Then align then
spectie[m] Get the word alignments and extract examples where you can
           theoretically get the right result (e.g. one of the biltrans words
           aligns to word in the target )  [15:44]ools/pull/394
spectie[m] Then extract n-grams around those target words and count them
spectie[m] Throw those counts into some kind of maxent learner  [15:45]
spectie[m] Extract features + target + translation + weight
spectie[m] But nowadays I'd use word embeddings  [15:46]


  • There has been some work on this already.
  • It would be awesome to set something up so that users could specify any number of SL words to generate rules for instead of getting rules for the entire language.

See also[edit]