Difference between revisions of "Prefixes and infixes"

From Apertium
Jump to navigation Jump to search
Line 113: Line 113:
<center>Example of stem/affix analysis</center>
<center>Example of stem/affix analysis</center>

Revision as of 12:31, 16 August 2007

Apertium was initially designed for languages in which word inflection manifests itself as changes in the suffix of words. For instance, in Spanish, cantar (to sing), cantarías (yo would sing), cantábamos (we sang), etc., all share a prefix cant-. Therefore, both Apertium's tagger and structural transfer assume that the lexical forms corresponding to these surface forms consists of a lemma (cantar) followed by a series of morphological symbols. For instance cantábamos would be cantar.vblex.pii.p1.pl (cantar, lexical verb, imperfect indicative, 1st person, plural).

But in other languages inflection occurs as prefixes or infixes. For instance, in Swahili kitabu means book and vitabu means books, so a natural way to represent their lexical forms would be sg.kitabu.n and pl.kitabu.n, or perhaps sg.n.kitabu and pl.n.kitabu, natural meaning that in this way, morphemes in lexical forms would be in the same order as in surface forms, and one could use this to form paradigms (for instance, the same singular/plural forms are found in many other Swahili nouns: kisu/visu (knife), kijiko/vijiko (spoon), etc.

These are difficult to treat in Apertium as it is now, so if we want Apertium to be used for more languages, we need to modify the part-of-speech tagger and the transfer.

  1. One possible solution would be to see lexical forms as sets and not as sequences. e.g. pl.n.kitabu or pl.kitabu.n would be the same (swahili). A normalization would have to take place somewhere (for instance, to kitabu.n.pl), but then the structural transfer module would have to be able to reorder (de-normalize) these tags into the order expected by the morphological generator.
    A suitable way of normalizing and denormalizing would be having a (source-language dependent) file which specifies a 'canonical order' used by tagger and transfer and another one which specifies the 'standard order' of morphemes in the target language. The bilingual dictionary would be in 'normalized form'. Something similar to this is actually performed by the pretransfer module which normalizes split lemmas such as take.vblex.sep.past_off to take_off.vblex.sep.past.
  2. Another possibility is to generalize the part-of-speech tagger and the transfer to be able to detect and deal with lexical forms in which the lemma can be split or come in any position whatsoever. As before, the person writing the tagger definition or the structural transfer rules would be responsible of managing these correctly.

Miscellaneous examples


  • to ask = kotúna (ko-tún-a) — infger.tún.vblex.infger
  • to wonder = komítúna (ko-mí-tún-a) — infger.ref.tún.vblex.infger
  • I asked = natúní (na-tún-í) — p1sg.tún.vblex.pastpres
  • You asked = otúní (o-tún-í) — p2sg.tún.vblex.pastpres

(pastpres is recent past used as present tense with some common verbs)

Bantu languages

The difficulties of applying the current Apertium model to tone-languages have been raised in other exchanges. An important group of tone-languages is the Bantu languages of sub-Saharan Africa, with which I am fairly familiar (though a bit rusty!). The Bantu nominal classes (eg Swahili mnazi - coconut palm, minazi - coconut palms) have also been mentioned previously. The structures in these languages can still be described as "inflection" (particularly when tone is involved), but with more synthetic structuring than in "classic" inflection languages - there are a lot more of what appear to be "infixes" (although these are not really infixes, but historical prefixes and suffixes).

For instance, in Swahili:

nilivunja kiti
1sing+past+BREAK+particle class7+WOOD
I broke a chair
I broke it
it is broken
mtu aliyesoma
{class1+BEING} {class1+past+class1-rel+READ+particle}
a person who read
mtu anayesoma
{class1+BEING} {class1+present+class1-rel+READ+particle}
a person who is reading
mtu asomaye
{class1+BEING} {class1+READ+particle+class1-rel}
a person who reads

It may be that Apertium as-is could handle these structures, but it seems to me that there is at least something to be said for the language plugin idea above. That is, create a database of words and relevant grammatical information about them, and produce a GBL of all possible forms in XML format. This is then compiled for Apertium as normal.

There is a risk of generating forms that are unlikely to occur in real life, and of course forms would have to be generated for every possible class marker used as object pronoun (example 3 above) and/or with various other infixes. But this is inherently no more difficult or redundant than trying to develop "paradigms" in the dictionary files themselves, and of course trying to deal with tones will add another layer of complexity (Swahili is almost unique among the Bantu languages in having lost its tones). Abstracting these issues into metadata layers, which is one way forward as I understand it, may lead to the dictionary files becoming comcomitantly difficult to maintain.

Agglutinating languages

See also: Agglutination and compounds

However, both the current Apertium system and the suggested plugin system face another set of difficulties with agglutinative languages like Quechua. For instance:

wasi — house
wasikuna — houses
wasita — to the house
wasikunata — to the houses
wasiy — my house
wasiita — to my house
wasiikuna — my houses
wasiikunata — to my houses
wasinchik — our house
wasinchikta — to our house
wasinchikkunata — to our houses

This sort of complex would actually fit quite well into the current Apertium model, although each paradigm would have a great number of possible members due to the large numbers of suffixes (and this is complicated by the fact that suffix order is variable). It could also be handled by form generation, again with the drawback that many thousands of possible forms would need to be generated.

An alternative method might entail slightly adjusting the way the morphological analyser works. In this approach, the binary dictionaries would consist only of stems and affixes, and instead of having the morphological analyser read to the end of the orthographic word, it would read only to the end of possible morphological boundaries with the word. A naïve algorithm (because I am not fully au fait with either the maths or the programming!) for this might be:

  1. Start at the first letter of the word.
  2. Collect all matches in the stem dictionary where that letter is the first letter.
  3. Read the next letter.
  4. Discard all items in the matched set that do not have that letter as second letter.
  5. Repeat 3 and 4 until the shortest stem that is present in the stem dictionary is found.
  6. Put this in a stem array and start using the affix dictionary as well. Set a new morphological boundary after that letter.
  7. For each subsequently-read letter, add matching stems to the stem array (working from the word-beginning), and add matching affixes to a new affix array (working from the previous morphological boundary).
  8. Each time an affix match is found, set a new morphological boundary after that letter, and start a new affix array.
  9. Add matches to the stem and affix arrays as appropriate until the end of the word.

If we take an imaginary set of stems ku, kuti, kutima, and an imaginary set of affixes -ti, -m, -ana, -ma, -na, possible segmentations for the imaginary kutimana would be:


These segmentations could be generated by the process above as shown in Table 1 (where NM = no match, and M = match), with the output in Table 2. Of course, using "tries" or something similar may be a much more efficient way of doing this than the naïve process above.

k NM × × × × × × ×
ku M \textglobfall × × × × × × ×
kut NM t NM × × × × × ×
kuti M ti M \textglobfall × × × × ×
kutim NM tim NM m M \textglobfall × × ×
kutima M tima NM ma M a NM \textglobfall (from ma) ×
kutiman NM timan NM man NM an NM n NM
kutimana NM timana NM mana NM ana M na M
Example of stem/affix analysis

ku -ti -m -ana -na
kuti -ma × ×
kutima × × × ×
Output from stem/affix analysis

Once a matrix of possible segmented forms has been generated for the word, there would then be the need to choose which of these are the ones intended.

One way of working towards this might be to have a table of possible affix combinations, with a likelihood assigned to each one. Something like the corpus generated by Kevin Scannell's Crubadán (http://borel.slu.edu/crubadan/index.html) might help here - a corpus is being collected for Bolivian Quechua and Ecuadorean Quichua (though not for Peruvian Quechua, which has more speakers).

Indeed, another way of approaching the segmentation issue would be to use such a table directly, but working backwards from the end of the orthographical word - this would require the analyser to reverse each word before analysis, and then remove the segment which matched the longest affix sequence in the table.

Either of these approaches (intra-word segmentation, affix table) would minimise the number of forms produced either by the current Apertium paradigm model, or by the suggested form generation model. It is likely that these techniques could also be used with other Native American languages.

See also