Dictionary maintenance

From Apertium
Revision as of 18:08, 8 June 2007 by Youssefsan (talk | contribs) (→‎Suggestions: #Splitting the data of monodix in several files : paradigms and lemmas or lemmas according to categories (verbs, nouns, adjectives, etc).)
Jump to navigation Jump to search

Including parts of dictionaries

The problem is that some parts of dictionaries that are standard between dictionaries in a pair are not kept in one file, but several (for example symbol definitions).


  • Use XInclude + xmllint to preprocess two xml files into a .dix file, then validate and compile the .dix file (cy-en, en-af use this)

Different registers/varieties/standards

In some pairs, e.g. Catalan, Portuguese, there is support for generating a particular standard of a language (e.g. Brazilian Portuguese, Valencian). The way this is done may need to be looked at.


Main article: Metadix


Keeping monodix updated

Problem managing conflict edits for monodix

There are more and more pairs, for examples 7 pairs with English and the monodix en is copied in every pair.

What happens if developer A and developer B edit both the en monodix, say in en-fr and en-es for example? Answer: another developer has to look on both version, look the diff and try to merge. Most of the time developer A tells developer to wait a few minutes or hours, then he commits and tells to developer B that he may now copy his version and starts working.

That is time consuming. For the near future it is manageable, since there are now only a half dozen developers that regulary go on irc to solve these issues. But in the long-term, it would become harder and harder. Imagine if there are 20 or 50 pairs with English. Imagine that all developers do not want to wait. There would be different monodix.


  • Language specific sections of monodix files.

Ideas to solve

  • Language specific parts could be split out into separate files, and then XIncluded, such as currently happens in several pairs (e.g. cy-en, en-af) with the symbol definitions <sdefs>.


  1. Table of contents for paradigms (It would be nice to have a kind of table of content of paradigms generated by script with a list of all paradigms in a monodix. For example "wo/nen, k/omen, etc". Words would be put in several categories : nouns, adjectives, verbs, etc.)
  2. Interface to add new words (It would be nice to have an inteface to add new words. That would attract non-geeks)
  3. Re-ordering of items in the dictionary (pardefs -> alphabetical order, sections -> alphabetical order, POS order etc.)
  4. Splitting the data of monodix in several files : paradigms and lemmas or lemmas according to categories (verbs, nouns, adjectives, etc).