Ideas for Google Summer of Code/Apertium FST CG
The purpose of this task is to create a replacement to the Constraint Grammar usage as the first step on Aertium disambiguation, before the part of speech tagger.
Currently, many language pairs use Constraint grammar as a pre-disambiguator for the Apertium tagger, allowing the imposition of more fine grained constraints than would be otherwise possible. However, current implementation of CG is much slower than most of the other modules in the Apertium pipeline, and it's also very different in terms of syntax to other Apertium modules (dictionaries, lexical selection, transfer rules, etc).
There have been a few attempts to create FST versions of CG (see User:David_Nemeskey/GSOC_progress_2013), but they haven't succeeded. The hypothesis is that a simpler version of CG that supports the main features that CG support (no need to feature parity) would have better adoption and integration within the Apertium pipeline.
The Constraint-based lexical selection module could be used as reference implementation, as it handles similar type of rules, but it looks to the source language (left part) to decide over the target language (right part), while in the disambiguation module both left and right side can be used to decide over the right part.
- Extract common use cases of Constraing Grammar in Apertium languages
- Create a prototype in a scripting language that allows for simple disambiguation rules (select/remove)