Ideas for Google Summer of Code/lint for Apertium
Make a program which tests Apertium data files for suspicious or unrecommended constructs (likely to be bugs). Sometimes when several people are working on the same code, things can get repeated, or beginners can make unrecommended changes. A lint tester would help people write standard code for dictionaries and transfer files.
The lint tool should at least check lttoolbox (dix) dictionary data, perhaps also transfer rules. The Apertium New Language Pair HOWTO should introduce most of the terminology and background you need (but Monodix basics and Contributing to an existing pair may also be useful).
Tasks
Coding challenge
- Write a program which parses a .dix file and for each (surface form, lexical form) pair, lists entries/paradigms which generate this pair.
Examples
Redundant entries: There may be two (or more) entries in a monolingual dictionary which generate the same lexical forms. It is more usual to find an entry which generates a subset of the lexical forms generated by another entry. For instance, a few weeks ago we found the entries:
<e lm="soleado"><i>solead</i><par n="absolut/o__adj"/></e> <e lm="soleado" a="prompsit"><i>solead</i><par n="abstract/o__adj"/></e>
The first one generates all the forms of the adjective for masculine/feminine and singular/plural. The second one generates the same forms but, in addition, it generates the superlative forms. Obviously, some redundancy exists in this case, so it may be interesting to detect this phenomena to choose the correct entry.
Repeated tags
A user attempting to write …<s n="n"/><s n="nt"/>…
might miss the last "t" and end up with …<s n="n"/><s n="n"/>…
. Repeated tags should be fairly easy to detect.
- Are there other easily detectable mistaggings? (We could assume some apertium-wide standards, like never have gender tags right after lemma, but that might quickly get very language pair specific.)