Ideas for Google Summer of Code/Monolingual and bilingual data decoupling

From Apertium

Jump to: navigation, search

Contents

Develop a method (scripts) to allow monolingual and bilingual data in Apertium to be decoupled, leaving each language pair with only the necessary bilingual data.

At the moment, Apertium has a separate module for each language pair. Each pair is self-contained, with a copy of both the monolingual data (e.g. POS tagger probabilities and monolingual dictionaries) and bilingual data (e.g. transfer rules and dictionaries). The method should be tested with es-ca, es-pt and pt-ca. After decoupling, all pairs should pass testvoc.

[edit] Tasks

  • Edit the lt-comp compiler to add a mode for compiling analysers that checks a bilingual dictionary to see if the current prefix is shared (see for example automatically trimming a monodix for a preliminary implementation)
  • Decouple some language pairs.
Optional
  • Edit each of the stages of Apertium that come after the POS tagger to accept input with original language surface form.

[edit] Coding challenge

  • Install Apertium (see Minimal installation from SVN).
  • Install the language pairs es-ca, pt-ca, es-pt.
  • Testvoc those language pairs, paying attention to variants (pt_BR/pt and ca_valencia/ca)
  • Try and copy the es-ca.ca.dix and the es-pt.pt.dix to pt-ca and see what happens.

[edit] Frequently asked questions

  • none yet, ask us something! :)

[edit] See also

Personal tools