Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Курсы машинного перевода для языков России/Session 1

From Apertium
Jump to: navigation, search

Contents

This session has two objectives, the first is to give an overview of the theory of morphology, how words are inflected and how new words are formed. And the second is to demonstrate how the analysis and generation of morphology is dealt with in Apertium.

[edit] Theory

The theory section is split into three subsections: The first deals with morphotactics, that is how morphemes (parts of words) occur and are joined together. The second gives some details of morphophonology, or how changes in morphemes happen as a result of them being joined together. The final section covers a theoretical description of how this is treated with computers.

[edit] Morphotactics

The morphotactics of a language is the way that morphemes in that language are joined together to form words. Morphemes are the smallest units of meaning. Morphemes can be free, or bound. They are free if they can occur on their own, and bound if they must be connected to another word. A single morpheme may have several allomorphs which mean the same thing but are written or spoken differently. For example the dative case (used to indicate movement in the direction of) in Chuvash has several allomorphs, which change depending on the vowel quality of the stem to which it attaches.

aчама ача·м·а "to my child"
ачамсене ача·м·сен·е "to my children"
ӗҫӗме ӗҫ·ӗм·е "to my work"
каҫмана каҫма·на "переходу"

Morphemes can be further split into two subtypes, inflectional and derivational. In the examples, · signifies a derivational boundary, and » signifies an inflectional boundary.

ӗҫ ӗҫ "работ·а"
ӗҫсем ӗҫ·сем "работ·ы"
ĕçчен ĕç»чен "работ»ник"
ĕçченсем ĕç»чен·сем "работ»ник·и"
ӗҫле ӗҫ»ле "работа»ть"
ӗҫле ӗҫ»ле»тер "to make (someone) work"

[edit] Inflection

Inflectional morphemes carry grammatical information, such as number, case, tense, etc., but do not change the word category (part of speech), nor do they change the basic semantic meaning. For example in Chuvash, ӗҫ and ӗҫсем have the same basic semantic meaning, but if you add the derivational affix -лЕ, ӗҫле, then the meaning changes to "do ӗҫ".

Examples of inflectional morphemes might be the -lar, -сем and plurals (kitap·lar (tr), ача·сем (cv), книг·и (ru)), and case endings -ран (ablative), -ті (translative), and -де (locative): уй·ран (cv), кань·ті (kv), үй·де (kk))

In translation, inflection is very frequently treated as a productive process, meaning there are rules to determine how the different inflections of a word change in translation.

[edit] Derivation

Derivational morphemes change the basic semantic meaning of a word, and can also change word category. Depending on the language pair involved, derivation is usually treated less than inflectional morphology, as the semantic changes caused by derivational morphemes can be more unpredictable.

Some examples of derivations might be -LIK in Kyrgyz (ай "month" + LIK = айлык "monthly wage"), -LA in Kyrgyz (ай "month" + LA = айла- "for a month to go by / пройти месяц"), and -ja in Finnish (kirjoitta+ja "write" + "agent" = "writer").

[edit] Compounding

Compounding is a process where two or more words are joined together to form one. In the languages spoken in Europe, this happens most productively in the Germanic languages and in the non-Indo European languages.

Examples of compound words might be:

  • Tietokoneanimaatioelokuva = Tietokone+animaatio+elo+kuva (fi)
  • Kontaktlinsenverträglichkeitstest = Kontakt+Linsen+Verträglichkeit(s)+Test (de)
  • Еlmegyógyintézet = Elme+gyógy+intézet (hu)
  • Giellamovttidanplána = Giella+movttidan+plána (se)

In languages where word compounding is very productive, it is desirable for compound words to be analysed and translated automatically. This reduces the size of the lexicon and allows previously unencountered forms to be dealt with nicely.

[edit] Clitics

A clitic is a syntactically independent word that functions phonologically as an affix of another word. For the purposes of machine translation between written languages, we are particularly interested in affixes which are either written orthographically together with another word, or are written separately but their form is conditioned by another word.

In the Turkic languages (and some Ugric languages) there is a question word (sometimes called particle), Turkish mA, Kyrgyz -BI, Kazakh MA, Finnish -kO, North Sámi -go. Examples: келесің бе? (kk) келесиңби? (ky) tuletko? (fi) boađátgo? (se) "are you coming?". This phoneme has status as a clitic because its phonological form is dependent on the previous word, but syntactically (and sometimes orthographically) it operates on its own.

In Tajik, there is a variant of the word for "and" which, even though it functions syntactically as a conjunction, attaches to the preceding word, whatever that may be. Its forms are (after consonants) and -ву (after vowels). An example would be be the alternative to чой ва шароб "tea and wine": чою шароб.

[edit] Morphophonology

Morphophonology studies the phonological changes that morphemes undergo when they are joined together. Morphophonology can be seen well in any number of morphemes in any number of languages, but here it will be explained using the plural suffix in Tatar, -/LAr/.

This suffix has four forms, depending on the noun it attaches to: -лар, -ләр, -нар, -нәр. Some examples include алма·лар "apples", тел·ләр "languages/tongues", урам+нар "streets", көн·нәр "days". Here, the first consonant alternates between /л/ and /н/ depending on the last sound of the word; in this case, it's /н/ if it immediately follows a nasal consonant (м, н, ң), and /л/ after everything else. The vowel /A/ alternates depending on the last vowel of the word: it's /а/ after "back/твёрдые vowels" (а, о, ы, у) and /ә/ after "front/мягкие vowels" (ә, э, ө, и, ү).

[edit] Computational representations

Computational models of morphology usually use tools called finite-state transducers to model both morphotactics and morphophonology. A finite-state transducer is a bit like a flowchart, where depending on the part of the word you are reading, you make different decisions as to what inflection or derivation it has. Unlike the typical flowchart however, a decision may lead to more than one conclusion!

(thumbnail)
A finite-state transducer modelling the basic nominal morphotactics (plural, possession, case) of three words in Bashkir. Note how archiphonemes (letters in { and }) are used to represent letters that can change according to the phonology.

The above transducer, once expanded is too big to easily read through, but if we remove the possessives, we can take a closer look at how it works.

(thumbnail)
A finite-state transducer modelling the case and number inflection of the Bashkir word мәктәп "school".

Consider the example of the word мәктәп "school" in Bashkir, it declines for number (singular, plural) and case (nominative, genitive, dative, accusative, locative and ablative). If we look at the transducer above, each arc in the graph has a label. The label has two parts, a left side (on the left of :) and a right side (on the right of :). If we read from left to right, we can analyse a word.

Singular Plural
Nominative мәктәп мәктәптәр
Accusative мәктәпте мәктәптәрҙе
Genitive мәктәптең мәктәптәрҙең
Locative мәктәптә мәктәптәрҙә
Ablative мәктәптән мәктәптәрҙән
Dative мәктәпкә мәктәптәргә

You can try doing this with one word from the declension table on the right. For example мәктәптәрҙән "from (the) schools". We should get the analysis мәктәп<n><pl><abl>. The process goes something like as follows:

  • read м, write м (input: м, оutput: м)
  • read ә, write ә (input: мә, оutput: мә)
  • read к, write к (input: мәк, оutput: мәк)
  • read т, write т (input: мәкт, оutput: мәкт)
  • read ә, write ә (input: мәктә, оutput: мәктә)
  • read п, write п (input: мәктәп, оutput: мәктәп)
  • read 0, write <n> (input: мәктәп0, оutput: мәктәп<n>)
  • read т, write <pl> (input: мәктәп0т, оutput: мәктәп<n><pl>)
  • read ә, write 0 (input: мәктәп0тә, оutput: мәктәп<n><pl>0)
  • read p, write 0 (input: мәктәп0тәp, оutput: мәктәп<n><pl>00)
  • read 0, write <abl> (input: мәктәп0тәp0, оutput: мәктәп<n><pl>00<abl>)
  • read ҙ, write 0 (input: мәктәп0тәp0ҙ, оutput: мәктәп<n><pl>00<abl>0)
  • read ә, write 0 (input: мәктәп0тәp0ҙә, оutput: мәктәп<n><pl>00<abl>00)
  • read н, write 0 (input: мәктәп0тәp0ҙән, оutput: мәктәп<n><pl>00<abl>000)

Note that reading or writing 0 is like reading or writing nothing.

[edit] Practice

There are two handouts for this practical,

[edit] Further reading

Personal tools