Ideas for Google Summer of Code/Easy dictionary maintenance
This involves building an application that parses and reads the open-class (noun, adjective, verb) single-word part of the dictionary amenable to simple, data-base-like treatment, saving the remaining (hard to treat) part of the dictionaries, allows the user to easily add words (together with their inflection paradigms) through a friendly user interface and then combines the extended single-word data with the remaining data -- without loss of formatting information (e.g. XML comments etc.) -- into Apertium monolingual and bilingual dictionaries ready to be compiled.
Ideas and code from Apertium-dixtools could be useful.
It could be interesting that the interface for adding new words is a web application.
Dixtools
apertium-dixtools is a Java-based package which provides an easy to use API to manipulate dictionaries; it includes several command-line callable tasks, which allow you to sort dictionaries, reformat, merge, remove duplicate entries and paradigms, import and export from a limited number of formats, etc.
As dixtools already has code to handle all of the common and uncommon cases in Apertium's dictionaries, it would make an ideal base for a dictionary maintenance tool.
Code challenges
We have a number of ideas for short challenges to create useful tasks for dixtools, which will allow us to assess the ability of the applicant better, while also being relevant to the larger task:
- Stem checker for monodix entries
- One of the most common errors in creating monolingual dictionary errors is not removing the suffix; this task will be to check each entry in each section for entries which have a paradigm containing a "/" (such as "bab/y__n"), and check that the <i> element does not contain the same as the lm attribute:
<e lm="baby"><i>baby</i><par n="bab/y__n"/></e>
- should be:
<e lm="baby"><i>bab</i><par n="bab/y__n"/></e>
- A warning, including line number, is all that is expected. For bonus points, correct the entry by removing the suffix (this should be an option to the tool).
src/dictools/DicFix.java
can be used as an example for this task; we only expect simple entries to be considered (e.children.size()==2, being 'i' and 'par')