Prefixes and infixes

From Apertium
Revision as of 07:50, 28 May 2007 by (talk) (a bit on normalization and de-normalization)
Jump to navigation Jump to search

Apertium was initially designed for languages in which word inflection manifests itself as changes in the suffix of words. For instance, in Spanish, cantar (to sing), cantarías (yo would sing), cantábamos (we sang), etc., all share a prefix cant-. Therefore, both Apertium's tagger and structural transfer assume that the lexical forms corresponding to these surface forms consists of a lemma (cantar) followed by a series of morphological symbols. For instance cantábamos would be (cantar, lexical verb, imperfect indicative, 1st person, plural).

But in other languages inflection occurs as prefixes or infixes. For instance, in Swahili kitabu means book and vitabu means books, so a natural way to represent their lexical forms would be sg.kitabu.n and pl.kitabu.n, or perhaps sg.n.kitabu and pl.n.kitabu, natural meaning that in this way, morphemes in lexical forms would be in the same order as in surface forms, and one could use this to form paradigms (for instance, the same singular/plural forms are found in many other Swahili nouns: kisu/visu (knife), kijiko/vijiko (spoon), etc.

These are difficult to treat in Apertium as it is now, so if we want Apertium to be used for more languages, we need to modify the part-of-speech tagger and the transfer.

  1. One possible solution would be to see lexical forms as sets and not as sequences. e.g. pl.n.kitabu or pl.kitabu.n would be the same (swahili). A normalization would have to take place somewhere (for instance, to, but then the structural transfer module would have to be able to reorder (de-normalize) these tags into the order expected by the morphological generator. A suitable way of normalizing and denormalizing would be having a (source-language dependent) file which specifies a 'canonical order' used by tagger and transfer and another one which specifies the 'standard order' of morphemes in the target language. The bilingual dictionary would be in 'normalized form'. Something similar to this is actually performed by the pretransfer module which normalizes split lemmas such as take.vblex.sep.past_off to take_off.vblex.sep.past.
  2. Another possibility is to generalize the part-of-speech tagger and the transfer to be able to detect and deal with lexical forms in which the lemma can be split or come in any position whatsoever. As before, the person writing the tagger definition or the structural transfer rules would be responsible of managing these correctly.