Difference between revisions of "Prefixes and infixes"
TommiPirinen (talk | contribs) (→See also: lexc) |
|||
(38 intermediate revisions by 10 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
Apertium was initially designed for languages in which word inflection manifests itself as changes in the suffix of words. For instance, in Spanish, ''cantar'' (to sing), ''cantarías'' (yo would sing), ''cantábamos'' (we sang), etc., all share a prefix ''cant-''. Therefore, both Apertium's tagger and structural transfer assume that the [[lexical form | lexical forms]] corresponding to these [[surface form | surface forms]] consists of a lemma (''cantar'') followed by a series of morphological symbols. For instance ''cantábamos'' would be <code>cantar.vblex.pii.p1.pl</code> (''cantar'', lexical verb, imperfect indicative, 1st person, plural). |
Apertium was initially designed for languages in which word inflection manifests itself as changes in the suffix of words. For instance, in Spanish, ''cantar'' (to sing), ''cantarías'' (yo would sing), ''cantábamos'' (we sang), etc., all share a prefix ''cant-''. Therefore, both Apertium's tagger and structural transfer assume that the [[lexical form | lexical forms]] corresponding to these [[surface form | surface forms]] consists of a lemma (''cantar'') followed by a series of morphological symbols. For instance ''cantábamos'' would be <code>cantar.vblex.pii.p1.pl</code> (''cantar'', lexical verb, imperfect indicative, 1st person, plural). |
||
Line 11: | Line 12: | ||
==Miscellaneous examples== |
==Miscellaneous examples== |
||
===Lingala=== |
|||
⚫ | |||
===Bantu languages=== |
|||
⚫ | |||
⚫ | |||
The Bantu nominal classes (eg Swahili '''mnazi''' - coconut palm, '''minazi''' - coconut palms) have been mentioned above. The structures in Bantu languages can still be described as "inflection" (particularly when tone is involved), but with more synthetic structuring than in "classic" inflection languages - there are a lot more of what appear to be "infixes" (although these are not really infixes, but historical prefixes and suffixes). |
|||
⚫ | |||
For instance, in '''Swahili''': |
|||
:'''nilivunja kiti''' |
|||
:{1sing+past+BREAK+particle} {class7+WOOD} |
|||
:''I broke a chair'' |
|||
:'''nilikivunja''' |
|||
:1sing+past+class7+BREAK+particle |
|||
:''I broke it'' |
|||
:'''kimevunjika''' |
|||
:class7+perfect+BREAK+stative+particle |
|||
:''it is broken'' |
|||
:'''mtu aliyesoma''' |
|||
:{class1+BEING} {class1+past+class1-rel+READ+particle} |
|||
:''a person who read'' |
|||
:'''mtu anayesoma''' |
|||
:{class1+BEING} {class1+present+class1-rel+READ+particle} |
|||
:''a person who is reading'' |
|||
:'''mtu asomaye''' |
|||
:{class1+BEING} {class1+READ+particle+class1-rel} |
|||
:''a person who reads'' |
|||
It may be that Apertium as-is could handle these structures, but there may be at least something to be said for an alternative "plugin" approach to different languages. This would involve creating a database of words and relevant grammatical information about them, and then using a set of rules to generate a Great Big List of all valid forms in XML format. This would then be compiled for Apertium as normal. |
|||
There is a risk of generating forms that are unlikely to occur in real life, and of course forms would have to be generated for every possible class marker used as object pronoun (example 2 above) and/or with various other infixes. But this is inherently no more difficult or redundant than trying to develop "paradigms" in the dictionary files themselves, and of course trying to deal with tones will add another layer of complexity (see Lingala below - Swahili is almost unique among the Bantu languages in having lost its tones). Abstracting these issues into [[Metaparadigms]], which may be the standard way forward, may lead to the dictionary files becoming comcomitantly difficult to maintain. |
|||
An example of verbal tones in '''Lingala''' is: |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
(pastpres is recent past used as present tense with some common verbs) |
|||
===Agglutinating languages=== |
|||
Both the current Apertium system and the suggested plugin system face another set of difficulties with agglutinative languages like Quechua, see [[Agglutination]]. |
|||
==See also== |
|||
* [[Metaparadigms]] |
|||
* [[Talk:Metadix]] proposal for handling complex combinations of prefixes/infixes using metadix (current metadix xslt's only handle one parameter) |
|||
* [[Partial hack for prefix inflection]] |
|||
* [[Lexc and flag diacritics for prefix tagging]] |
|||
* [[Cookbook#Simple prefixes with grammatical load|Simple prefixes with grammatical load]] in the Cookbook |
|||
[[Category:Development]] |
|||
[[Category:Documentation in English]] |
Latest revision as of 13:49, 21 August 2018
Apertium was initially designed for languages in which word inflection manifests itself as changes in the suffix of words. For instance, in Spanish, cantar (to sing), cantarías (yo would sing), cantábamos (we sang), etc., all share a prefix cant-. Therefore, both Apertium's tagger and structural transfer assume that the lexical forms corresponding to these surface forms consists of a lemma (cantar) followed by a series of morphological symbols. For instance cantábamos would be cantar.vblex.pii.p1.pl
(cantar, lexical verb, imperfect indicative, 1st person, plural).
But in other languages inflection occurs as prefixes or infixes. For instance, in Swahili kitabu means book and vitabu means books, so a natural way to represent their lexical forms would be sg.kitabu.n
and pl.kitabu.n
, or perhaps sg.n.kitabu
and pl.n.kitabu
, natural meaning that in this way, morphemes in lexical forms would be in the same order as in surface forms, and one could use this to form paradigms (for instance, the same singular/plural forms are found in many other Swahili nouns: kisu/visu (knife), kijiko/vijiko (spoon), etc.
These are difficult to treat in Apertium as it is now, so if we want Apertium to be used for more languages, we need to modify the part-of-speech tagger and the transfer.
- One possible solution would be to see lexical forms as sets and not as sequences. e.g.
pl.n.kitabu
orpl.kitabu.n
would be the same (swahili). A normalization would have to take place somewhere (for instance, tokitabu.n.pl
), but then the structural transfer module would have to be able to reorder (de-normalize) these tags into the order expected by the morphological generator.- A suitable way of normalizing and denormalizing would be having a (source-language dependent) file which specifies a 'canonical order' used by tagger and transfer and another one which specifies the 'standard order' of morphemes in the target language. The bilingual dictionary would be in 'normalized form'. Something similar to this is actually performed by the
pretransfer module
which normalizes split lemmas such astake.vblex.sep.past_off
totake_off.vblex.sep.past
.
- A suitable way of normalizing and denormalizing would be having a (source-language dependent) file which specifies a 'canonical order' used by tagger and transfer and another one which specifies the 'standard order' of morphemes in the target language. The bilingual dictionary would be in 'normalized form'. Something similar to this is actually performed by the
- Another possibility is to generalize the part-of-speech tagger and the transfer to be able to detect and deal with lexical forms in which the lemma can be split or come in any position whatsoever. As before, the person writing the tagger definition or the structural transfer rules would be responsible of managing these correctly.
Miscellaneous examples[edit]
Bantu languages[edit]
The Bantu nominal classes (eg Swahili mnazi - coconut palm, minazi - coconut palms) have been mentioned above. The structures in Bantu languages can still be described as "inflection" (particularly when tone is involved), but with more synthetic structuring than in "classic" inflection languages - there are a lot more of what appear to be "infixes" (although these are not really infixes, but historical prefixes and suffixes).
For instance, in Swahili:
- nilivunja kiti
- {1sing+past+BREAK+particle} {class7+WOOD}
- I broke a chair
- nilikivunja
- 1sing+past+class7+BREAK+particle
- I broke it
- kimevunjika
- class7+perfect+BREAK+stative+particle
- it is broken
- mtu aliyesoma
- {class1+BEING} {class1+past+class1-rel+READ+particle}
- a person who read
- mtu anayesoma
- {class1+BEING} {class1+present+class1-rel+READ+particle}
- a person who is reading
- mtu asomaye
- {class1+BEING} {class1+READ+particle+class1-rel}
- a person who reads
It may be that Apertium as-is could handle these structures, but there may be at least something to be said for an alternative "plugin" approach to different languages. This would involve creating a database of words and relevant grammatical information about them, and then using a set of rules to generate a Great Big List of all valid forms in XML format. This would then be compiled for Apertium as normal.
There is a risk of generating forms that are unlikely to occur in real life, and of course forms would have to be generated for every possible class marker used as object pronoun (example 2 above) and/or with various other infixes. But this is inherently no more difficult or redundant than trying to develop "paradigms" in the dictionary files themselves, and of course trying to deal with tones will add another layer of complexity (see Lingala below - Swahili is almost unique among the Bantu languages in having lost its tones). Abstracting these issues into Metaparadigms, which may be the standard way forward, may lead to the dictionary files becoming comcomitantly difficult to maintain.
An example of verbal tones in Lingala is:
- to ask = kotúna (ko-tún-a) —
infger.tún.vblex.infger
- to wonder = komítúna (ko-mí-tún-a) —
infger.ref.tún.vblex.infger
- I asked = natúní (na-tún-í) —
p1sg.tún.vblex.pastpres
- You asked = otúní (o-tún-í) —
p2sg.tún.vblex.pastpres
(pastpres is recent past used as present tense with some common verbs)
Agglutinating languages[edit]
Both the current Apertium system and the suggested plugin system face another set of difficulties with agglutinative languages like Quechua, see Agglutination.
See also[edit]
- Metaparadigms
- Talk:Metadix proposal for handling complex combinations of prefixes/infixes using metadix (current metadix xslt's only handle one parameter)
- Partial hack for prefix inflection
- Lexc and flag diacritics for prefix tagging
- Simple prefixes with grammatical load in the Cookbook