Difference between revisions of "Курсы машинного перевода для языков России/Session 1"

From Apertium
Jump to navigation Jump to search
Line 69: Line 69:


==Practice==
==Practice==
{{main|Как использовать HFST, чтобы разработать новый морфологический анализатор}}


See handout "Как использовать HFST, чтобы разработать новый морфологический анализатор".
See handout "Как использовать HFST, чтобы разработать новый морфологический анализатор".

Revision as of 17:04, 18 December 2011

This session has two objectives, the first is to give an overview of the theory of morphology, how words are inflected and how new words are formed. And the second is to demonstrate how the analysis and generation of morphology is dealt with in Apertium.

Theory

The theory section is split into three subsections: The first deals with morphotactics, that is how morphemes (parts of words) occur and are joined together. The second gives some details of morphophonology, or how changes in morphemes happen as a result of them being joined together. The final section covers a theoretical description of how this is treated with computers.

Morphotactics

The morphotactics of a language is the way that morphemes in that language are joined together to form words. Morphemes are the smallest units of meaning. Morphemes can be free, or bound. They are free if they can occur on their own, and bound if they must be connected to another word. A single morpheme may have several allomorphs which mean the same thing but are written or spoken differently. For example the dative case (used to indicate movement in the direction of) in Chuvash has several allomorphs, which change depending on the vowel quality of the stem to which it attaches.

aчама ача·м·а "to my child"
ачамсене ача·м·сен·е "to my children"
уӗҫӗме уӗҫ·ӗм·е "to my street"

Morphemes can be further split into two subtypes, inflectional and derivational. In the two examples above, signifies a derivational boundary, and + signifies an inflectional boundary.

TODO; something about derivation here

Inflection

Inflectional morphemes carry grammatical information, such as number, case, tense, etc. but do not change the word category (part of speech), nor do they change the basic semantic meaning. For example in Spanish, fácil and fáciles have the same basic semantic meaning, but if you add the derivational affix -mente, fácilmente then the meaning changes to "in a manner which is fácil".

Examples of inflectional morphemes might be the -lar, -сен and -и plurals (kitaplar (tr), ачасен (cv), книги (ru)), and case endings -ран (ablative), -kor (temporal), and -eh (locative): уй+ран (cv), öt+kor(hu), vas+eh(sl))

In translation, inflection is very frequently treated as a productive process, meaning there are rules to determine how the different inflections of a word change in translation.

Derivation

Derivational morphemes change the basic semantic meaning of a word, and can also change word category. Depending on the language pair involved, derivation is usually treated less than inflectional morphology, as the semantic changes caused by derivational morphemes can be more unpredictable.

Some examples of derivations might be -DAKi in Turkish and -ja in Finnish (Makedonya'+daki, kirjoa+ja).

TODO; more details

Compounding

Compounding is a process where two or more words are joined together to form one. In the languages spoken in Europe, this happens most productively in the Germanic languages and in the non-Indo European languages.

Examples of compound words might be:

  • Tietokoneanimaatioelokuva = Tietokone+animaatio+elo+kuva (fi)

TODO; examples

  • Kontaktlinsenverträglichkeitstest = Kontakt+Linsen+Verträglichkeit(s)+Test(de)
  • Infrastruktuurontwikkelingsplan = Infrastruktuur+ontwikkelings+plan(nl)
  • Specialisthelsetjenestelov = Specialist+helse+tjeneste+lov(da)

In languages where word compounding is very productive, it is desirable for compound words to be analysed and translated automatically.

Morphophonology

Morphophonology studies the phonological changes that morphemes undergo when they are joined together. For example, when the -ksi suffix is added to the word elefantti in Finnish, the consonant tt is shortened to t to produce elefantiksi. The superessive suffix in Hungarian is -n, but when it is affixed to a stem ending in a consonant, it receives a linking vowel o, e or ö depending on the vowels in the stem of the word, so for example: asztalon, hegyen, könyvön.

TODO; examples

Computational representations of morphology

TODO; examples

Computational models of morphology usually use tools called finite-state transducers to model both morphotactics and morphophonology. A finite-state transducer is a bit like a flowchart, where depending on the part of the word you are reading, you make different decisions as to what inflection or derivation it has. Unlike the typical flowchart however, a decision may lead to more than one conclusion!

Consider the example of the word vas in Slovenian, it declines for number (singular, dual, plural) and case (nominative, genitive, dative, accusative, locative and instrumental). If we look at the transducer on the right, each arc in the graph has a label. The label has two parts, a left side (on the left of :) and a right side (on the right of :). If we read from left to right, we can analyse a word.

You can try doing this with one word from the declension table on the right. For example vasema. We should get two analyses, dual dative and dual instrumental. The process goes something like as follows:

Note that reading or writing 0 is like reading or writing nothing.

Practice

Main article: Как использовать HFST, чтобы разработать новый морфологический анализатор

See handout "Как использовать HFST, чтобы разработать новый морфологический анализатор".