Contributing to an existing pair
How to add linguistic data to an existing language pair in Apertium.
Apertium has data for many languages pairs. These linguistic data include mainly dictionaries (monolingual and bilingual), structural transfer rules that perform grammatical and other transformations between the two languages involved, and lexical data for the part-of-speech tagger, which is in charge of the disambiguation of the source language text.
All these linguistic data are contained in a single directory. For example,
apertium-es-ca for the Spanish-Catalan pair. The files that you can find in these directories are described next.
Example file layout
For the Spanish–Catalan pair (apertium-es-ca):
- apertium-es-ca.es.dix : Spanish monolingual dictionary, containing 11,800 entries (as of 17 november 2005)
- apertium-es-ca.ca.dix : Catalan monolingual dictionary, containing 11,800 entries.
- apertium-es-ca.es-ca.dix : Spanish-Catalan bilingual dictionary, containing 12,800 entries (correspondences Spanish-Catalan).
- apertium-es-ca.trules-es-ca.xml : Structural transfer rules for the translation from Spanish to Catalan.
- apertium-es-ca.trules-ca-es.xml : Structural transfer rules for the translation from Catalan to Spanish.
- apertium-es-ca.es.tsx : Tagger definition file for Spanish
- apertium-es-ca.ca.tsx : Tagger definition file for Catalan
- apertium-es-ca.post-es.dix : Post-generation dictionary for Spanish, with 25 entries and 5 paradigms (applies when translating from Catalan to Spanish)
- apertium-es-ca.post-ca.dix : Post-generation dictionary for Catalan, with 16 entries and 57 paradigms (applies when translating from Spanish to Catalan)
- directory es-tagger-data : Contains data needed for the Spanish tagger (corpora, etc.)
- directory ca-tagger-data : Contains data needed for the Catalan tagger (corpora, etc.)
Adding words to the dictionaries
When extending or adapting Apertium, the most likely operation that will be performed will be to extend its dictionaries. In fact, it will be far more common than adding transfer or post-generation rules.
IMPORTANT: Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved (apertium-es-ca, apertium-es-gl or what may be applicable) so that the system generates the new binary files.
If you want to add a new word to Apertium, you need to add three entries in the dictionaries. Suppose you are working with the Spanish-Catalan pair. In this case, you have to add:
- an entry in the Spanish monolingual dictionary: so that the translator can analyze ("understand") the word when it finds it in a text, and generate it when translating this word into Spanish.
- an entry in the bilingual dictionary: so that you can tell Apertium how to translate this word from one language to the other.
- an entry in the Catalan monolingual dictionary: so that the translator can analyze ("understand") the word when it finds it in a text, and generate it when translating this word into Catalan.
You will need to go to the directory containing the XML dictionaries (for the Spanish-Catalan pair, this is apertium-es-ca) and open with a text editor or a specialized XML editor the three dictionary files mentioned: apertium-es-ca.es.dix, apertium-es-ca.es-ca.dix and apertium-es-ca.ca.dix. The entries you need to create in these three dictionaries share a common structure.
Monolingual dictionary (Spanish)
You may want, for example, to add the Spanish adjective "cósmico", whose equivalent in Catalan is "còsmic". The first step is to add this word to the Spanish monolingual dictionary. You will see that a monolingual dictionary has basically two types of data: paradigms (in the "<pardefs>" section of the dictionary, each paradigm inside a <pardef> element) and word entries (in the main