Ideas for Google Summer of Code/Monolingual and bilingual data decoupling
From Apertium
|
Develop a method (scripts) to allow monolingual and bilingual data in Apertium to be decoupled, leaving each language pair with only the necessary bilingual data.
At the moment, Apertium has a separate module for each language pair. Each pair is self-contained, with a copy of both the monolingual data (e.g. POS tagger probabilities and monolingual dictionaries) and bilingual data (e.g. transfer rules and dictionaries). The method should be tested with es-ca, es-pt and pt-ca. After decoupling, all pairs should pass testvoc.
[edit] Tasks
- Edit the lt-comp compiler to add a mode for compiling analysers that checks a bilingual dictionary to see if the current prefix is shared (see for example automatically trimming a monodix for a preliminary implementation)
- Decouple some language pairs.
- Optional
- Edit each of the stages of Apertium that come after the POS tagger to accept input with original language surface form.
[edit] Coding challenge
- Install Apertium (see Minimal installation from SVN).
- Install the language pairs es-ca, pt-ca, es-pt.
- Testvoc those language pairs, paying attention to variants (pt_BR/pt and ca_valencia/ca)
- Try and copy the es-ca.ca.dix and the es-pt.pt.dix to pt-ca and see what happens.
[edit] Frequently asked questions
- none yet, ask us something! :)

