Difference between revisions of "Agglutination"

From Apertium
Jump to navigation Jump to search
(New page: Both the current Apertium system and the suggested plugin system face another set of difficulties with agglutinative languages like Quechua. For instance: :'''wasi''' — house :'''...)
 
 
(9 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
An '''agglutinative''' language in the strict definition will form words by joining together unchangable stems and affixes, where these don't fuse or change form dependent on other affixes. But "agglutinative" is also often used incorrectly as a synonym for the wider category "synthetic", which includes fusional and inflected languages ('''synthetic''' languages have a high morpheme-to-word ratio).
   
Both the current Apertium system and the suggested plugin system face another set of difficulties with agglutinative languages like Quechua. For instance:
+
There some difficulties with agglutinative languages like Quechua.
  +
  +
==Example==
  +
  +
For instance:
   
 
:'''wasi''' — house
 
:'''wasi''' — house
Line 14: Line 19:
 
:'''wasinchikkunata''' — to our houses
 
:'''wasinchikkunata''' — to our houses
   
  +
Or Basque:
This sort of complex would actually fit quite well into the current Apertium model, although each paradigm would have a great number of possible members due to the large numbers of suffixes (and this is complicated by the fact that suffix order is variable). It could also be handled by form generation, again with the drawback that many thousands of possible forms would need to be generated.
 
   
  +
:'''etxea''' la casa
An alternative method might entail slightly adjusting the way the morphological analyser works. In this approach, the binary dictionaries would consist only of stems and affixes, and instead of having the morphological analyser read to the end of the orthographic word, it would read only to the end of possible morphological boundaries within the word. A naïve algorithm for this might be:
 
  +
:'''etxe gorria''' la casa roja
  +
:'''etxe gorri zaharra''' la casa roja y vieja
  +
:'''etxe gorri zaharrarekin''' con la casa roja y vieja
  +
:'''etxe gorri zaharrarentzat''' para la casa roja y vieja
   
 
This sort of complex would actually fit quite well into the current Apertium model, although each paradigm would have a great number of possible members due to the large numbers of suffixes (and this is complicated by the fact that suffix order is variable). It could also be handled by form generation, again with the drawback that many thousands of possible forms would need to be generated.
# Start at the first letter of the word.
 
# Collect all matches in the stem dictionary where that letter is the first letter.
 
# Read the next letter.
 
# Discard all items in the matched set that do not have that letter as second letter.
 
# Repeat 3 and 4 until the shortest stem that is present in the stem dictionary is found.
 
# Put this in a stem array and start using the affix dictionary as well. Set a new morphological boundary after that letter.
 
# For each subsequently-read letter, add matching stems to the stem array (working from the word-beginning), and add matching affixes to a new affix array (working from the previous morphological boundary).
 
# Each time an affix match is found, set a new morphological boundary after that letter, and start a new affix array.
 
# Add matches to the stem and affix arrays as appropriate until the end of the word.
 
 
If we take an imaginary set of stems '''ku''', '''kuti''', '''kutima''', and an imaginary set of affixes '''-ti''', '''-m''', '''-ana''', '''-ma''', '''-na''', possible segmentations for the imaginary '''kutimana''' would be:
 
 
:'''ku-ti-m-ana'''
 
:'''ku-ti-ma-na'''
 
:'''kuti-m-ana'''
 
:'''kuti-ma-na'''
 
:'''kutima-na'''
 
 
These segmentations could be generated by the process above as shown in Table 1 (where '''NM''' = no match, '''M''' = match, and '''->Arr''' = start new array), with the output in Table 2. Of course, using [http://en.wikipedia.org/wiki/Trie tries] or something similar may be a much more efficient way of doing this than the naïve process above.
 
 
{|align=center
 
|k || NM || || || || || || || ||
 
|-
 
|ku || M || ->Arr|| || || || || || ||
 
|-
 
|kut || NM || t || NM || || || || || ||
 
|-
 
|kuti || M || ti || M || ->Arr|| || || || ||
 
|-
 
|kutim || NM || tim || NM || m || M || ->Arr|| || ||
 
|-
 
|kutima || M || tima || NM || ma || M || a || NM || ->Arr (from ma) ||
 
|-
 
|kutiman || NM || timan || NM || man || NM || an || NM || n || NM
 
|-
 
|kutimana || NM || timana || NM || mana || NM || ana || M || na || M
 
|}
 
<center>Table 1 - Example of stem/affix analysis</center>
 
 
 
 
{|align=center
 
|ku || -ti || -m || -ana || -na
 
|-
 
|kuti || || -ma || ||
 
|-
 
|kutima || || || ||
 
|}
 
<center>Table 2 - Output from stem/affix analysis</center>
 
   
  +
: Paradigms can refer to other paradigms, so this kind of thing should work just fine?
   
Once a matrix of possible segmented forms has been generated for the word, there would then be the need to choose which of these are the ones intended.
 
   
  +
==Alternatives to lttoolbox==
One way of working towards this might be to have a table of possible affix combinations, with a likelihood assigned to each one. Something like the corpus generated by Kevin Scannell's Crubadán (http://borel.slu.edu/crubadan/index.html) might help here - a corpus is being collected for Bolivian Quechua and Ecuadorean Quichua (though not for Peruvian Quechua, which has more speakers).
 
  +
Other systems popular for agglutinative languages:
   
  +
* [[HFST]]
Indeed, another way of approaching the segmentation issue would be to use such a table directly, but working backwards from the end of the orthographical word - this would require the analyser to reverse each word before analysis, and then remove the segment which matched the longest affix sequence in the table.
 
  +
* [[SFST]] (see also [[Omorfi]])
  +
* [[Hunmorph]]
   
  +
== See also ==
Either of these approaches (intra-word segmentation, affix table) would minimise the number of forms produced either by the current Apertium paradigm model, or by the suggested form generation model. It is likely that these techniques could also be used with other Native American languages.
 
  +
* [[Prefixes and infixes]]
   
 
[[Category:Development]]
 
[[Category:Development]]
  +
[[Category:Writing dictionaries]]
  +
[[Category:Documentation in English]]

Latest revision as of 06:30, 20 October 2014

An agglutinative language in the strict definition will form words by joining together unchangable stems and affixes, where these don't fuse or change form dependent on other affixes. But "agglutinative" is also often used incorrectly as a synonym for the wider category "synthetic", which includes fusional and inflected languages (synthetic languages have a high morpheme-to-word ratio).

There some difficulties with agglutinative languages like Quechua.

Example[edit]

For instance:

wasi — house
wasikuna — houses
wasita — to the house
wasikunata — to the houses
wasiy — my house
wasiita — to my house
wasiikuna — my houses
wasiikunata — to my houses
wasinchik — our house
wasinchikta — to our house
wasinchikkunata — to our houses

Or Basque:

etxea la casa
etxe gorria la casa roja
etxe gorri zaharra la casa roja y vieja
etxe gorri zaharrarekin con la casa roja y vieja
etxe gorri zaharrarentzat para la casa roja y vieja

This sort of complex would actually fit quite well into the current Apertium model, although each paradigm would have a great number of possible members due to the large numbers of suffixes (and this is complicated by the fact that suffix order is variable). It could also be handled by form generation, again with the drawback that many thousands of possible forms would need to be generated.

Paradigms can refer to other paradigms, so this kind of thing should work just fine?


Alternatives to lttoolbox[edit]

Other systems popular for agglutinative languages:

See also[edit]