Ideas for Google Summer of Code/Easy dictionary maintenance

From Apertium
Jump to navigation Jump to search

This involves building an application that parses and reads the open-class (noun, adjective, verb) single-word part of the dictionary amenable to simple, data-base-like treatment, saving the remaining (hard to treat) part of the dictionaries, allows the user to easily add words (together with their inflection paradigms) through a friendly user interface and then combines the extended single-word data with the remaining data -- without loss of formatting information (e.g. XML comments etc.) -- into Apertium monolingual and bilingual dictionaries ready to be compiled.

Ideas and code from Apertium-dixtools could be useful.

It could be interesting that the interface for adding new words is a web application. It would also be interesting to some how do this with MediaWiki. For example, set up a MediaWiki installation where "paradigms" are templates, "categories" are sections and "articles/pages" are words. In MediaWiki, templates can be applied recursively, as can paradigms. Both import and export would be needed.

Dixtools

apertium-dixtools is a Java-based package which provides an easy to use API to manipulate dictionaries; it includes several command-line callable tasks, which allow you to sort dictionaries, reformat, merge, remove duplicate entries and paradigms, import and export from a limited number of formats, etc.

As dixtools already has code to handle all of the common and uncommon cases in Apertium's dictionaries, it would make an ideal base for a dictionary maintenance tool.

Code challenges

We have a number of ideas for short challenges to create useful tasks for dixtools, which will allow us to assess the ability of the applicant better, while also being relevant to the larger task:

Stem checker for monodix entries
One of the most common errors in creating monolingual dictionary errors is not removing the suffix; this task will be to check each entry in each section for entries which have a paradigm containing a "/" (such as "bab/y__n"), and check that the <i> element does not contain the same as the lm attribute:
  <e lm="baby"><i>baby</i><par n="bab/y__n"/></e>
should be:
  <e lm="baby"><i>bab</i><par n="bab/y__n"/></e>
A warning, including line number, is all that is expected. For bonus points, correct the entry by removing the suffix (this should be an option to the tool).
src/dictools/DicFix.java can be used as an example for this task; we only expect simple entries to be considered (e.children.size()==2, being 'i' and 'par')