Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Prefixes and infixes

From Apertium
(Difference between revisions)
Jump to: navigation, search
m
(a bit on normalization and de-normalization)
Line 5: Line 5:
 
These are difficult to treat in Apertium as it is now, so if we want Apertium to be used for more languages, we need to modify the part-of-speech tagger and the transfer.
 
These are difficult to treat in Apertium as it is now, so if we want Apertium to be used for more languages, we need to modify the part-of-speech tagger and the transfer.
   
# One possible solution would be to see [[lexical form | lexical forms]] as sets and not as sequences. e.g. <code>pl.n.kitabu</code> or <code>pl.kitabu.n</code> would be the same (swahili). A ''normalization'' would have to take place somewhere (for instance, to <code>kitabu.n.pl</code>), but then the structural transfer module would have to be able to reorder (''de-normalize'') these tags into the order expected by the morphological generator. Something similar to this is actually performed by the <code>pretransfer module</code> which normalizes ''split lemmas'' such as <code>take.vblex.sep.past_off</code> to <code>take_off.vblex.sep.past</code>.
+
# One possible solution would be to see [[lexical form | lexical forms]] as sets and not as sequences. e.g. <code>pl.n.kitabu</code> or <code>pl.kitabu.n</code> would be the same (swahili). A ''normalization'' would have to take place somewhere (for instance, to <code>kitabu.n.pl</code>), but then the structural transfer module would have to be able to reorder (''de-normalize'') these tags into the order expected by the morphological generator. A suitable way of normalizing and denormalizing would be having a (source-language dependent) file which specifies a 'canonical order' used by tagger and transfer and another one which specifies the 'standard order' of morphemes in the target language. The bilingual dictionary would be in 'normalized form'. Something similar to this is actually performed by the <code>pretransfer module</code> which normalizes ''split lemmas'' such as <code>take.vblex.sep.past_off</code> to <code>take_off.vblex.sep.past</code>.
 
# Another possibility is to generalize the part-of-speech tagger and the transfer to be able to detect and deal with lexical forms in which the lemma can be split or come in any position whatsoever. As before, the person writing the tagger definition or the structural transfer rules would be responsible of managing these correctly.
 
# Another possibility is to generalize the part-of-speech tagger and the transfer to be able to detect and deal with lexical forms in which the lemma can be split or come in any position whatsoever. As before, the person writing the tagger definition or the structural transfer rules would be responsible of managing these correctly.

Revision as of 08:50, 28 May 2007

Apertium was initially designed for languages in which word inflection manifests itself as changes in the suffix of words. For instance, in Spanish, cantar (to sing), cantarías (yo would sing), cantábamos (we sang), etc., all share a prefix cant-. Therefore, both Apertium's tagger and structural transfer assume that the lexical forms corresponding to these surface forms consists of a lemma (cantar) followed by a series of morphological symbols. For instance cantábamos would be cantar.vblex.pii.p1.pl (cantar, lexical verb, imperfect indicative, 1st person, plural).

But in other languages inflection occurs as prefixes or infixes. For instance, in Swahili kitabu means book and vitabu means books, so a natural way to represent their lexical forms would be sg.kitabu.n and pl.kitabu.n, or perhaps sg.n.kitabu and pl.n.kitabu, natural meaning that in this way, morphemes in lexical forms would be in the same order as in surface forms, and one could use this to form paradigms (for instance, the same singular/plural forms are found in many other Swahili nouns: kisu/visu (knife), kijiko/vijiko (spoon), etc.

These are difficult to treat in Apertium as it is now, so if we want Apertium to be used for more languages, we need to modify the part-of-speech tagger and the transfer.

  1. One possible solution would be to see lexical forms as sets and not as sequences. e.g. pl.n.kitabu or pl.kitabu.n would be the same (swahili). A normalization would have to take place somewhere (for instance, to kitabu.n.pl), but then the structural transfer module would have to be able to reorder (de-normalize) these tags into the order expected by the morphological generator. A suitable way of normalizing and denormalizing would be having a (source-language dependent) file which specifies a 'canonical order' used by tagger and transfer and another one which specifies the 'standard order' of morphemes in the target language. The bilingual dictionary would be in 'normalized form'. Something similar to this is actually performed by the pretransfer module which normalizes split lemmas such as take.vblex.sep.past_off to take_off.vblex.sep.past.
  2. Another possibility is to generalize the part-of-speech tagger and the transfer to be able to detect and deal with lexical forms in which the lemma can be split or come in any position whatsoever. As before, the person writing the tagger definition or the structural transfer rules would be responsible of managing these correctly.
Personal tools