Курсы машинного перевода для языков России/Session 0

From Apertium
< Курсы машинного перевода для языков России
Revision as of 14:35, 18 December 2011 by Francis Tyers (talk | contribs) (Created page with 'Session 0: Overview This session will give a short overview of the field of rule-based machine translation, and introduce how the free open/source rule-based machine translation…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Session 0: Overview

This session will give a short overview of the field of rule-based machine translation, and introduce how the free open/source rule-based machine translation platform Apertium is used. Types of machine translation

There are two principal types of machine translation:

  • Rule-based machine translation (RBMT), also called symbolic MT; Apertium falls into this category and this session focusses on the sub-types of RBMT
  • Corpus-based machine translation; uses collections of previously translated sentences to propose translations of new sentences.

A brief overview of corpus-based MT would split it into two main subgroups, statistical and example based. In theory, the basic approach to statistical machine translation works by taking a collection of previously translated sentence (a parallel corpus) and calculating which tokens cooccur most frequently. All of the tokens that cooccur are assigned a probability. When translating a new sentence, these words are looked up, their probabilities combined, many possible translations are made and then the translation with the highest probability may be selected. The first statistical machine translation systems used coocurrences of words, but newer systems can use sequences of words (sometimes called phrases), and hierarchical trees. By contrast example-based machine translation can be thought to be translation by analogy. It still uses a parallel corpus, but instead of assigning probabilities to words, it tries to learn by example. For example, given the sentence pairs (A la chica le gustan los gatos(es) → Das Mädchen mag Katzen(de) and A la chica le gustan los elefantes → Das Mädchen mag Elefanten) it might produce a translation example of (A la chica le gustan X → Das Mädchen mag X). When translating a new sentence, the parts are looked up and substituted.

Automatically applying a large translation memory to a text may also be considered a form of corpus-based machine translation. In practice, the lines between statistical and example-based MT are more blurry. Both rule-based and corpus-based methods have advantages and disadvantges. Corpus-based methods may produce translations which sound more fluent, but the meaning may be less faithfully reproduced. Rule-based systems tend to produce translations which are less fluent, but more preserving of the source language meaning.

Rule-based and corpus-based systems can also be combined in various ways as hybrid systems. For example, one might make a hybrid system that uses an example-based system to find equivalences, and then uses a rule-based system as backoff — when no pattern is found.


Direct Direct, or word-for-word machine translation works by reading words in the source language one at a time, and looking them up in a bilingual word list of surface forms. Words may also be deleted, or left out, and maybe translated to one or more words. No grammatical analysis is done, so even simple errors, such as agreement in gender and number between a determiner and head noun will remain in the target language output.


Heinrich köpeğine bir parça et verir. << TXUVAIX AQUÍ >> Генрих даёт кусок мяса своей собаке


Transfer-based machine translation works by first converting the source language to a language-dependent intermediate representation, and then rules are applied to this intermediate representation in order to change the structure of the source language to the structure of the target language. The translation is generated from this representation using both bilingual dictionaries and grammatical rules.

There can be differences in the level of abstraction of the intermediate representation. We can distinguish two broad groups, shallow transfer, and deep transfer. In shallow-transfer MT the intermediate representation is usually either based on morphology or shallow syntax. In deep-transfer MT the intermediate representation usually includes some kind of parse tree or graph structure (see image on the right).


Transfer-based MT usually works as follows: The original text is first analysed and disambiguated morphologically (and in the case of deep transfer, syntactically) in order to obtain the source language intermediate representation. The transfer process then converts this final representation (still in the source language) to a representation of the same level of abstraction in the target language. From the target language representation, the target language is generated.


In transfer-based machine translation, rules are written on a pair-by-pair basis, making them specific to a language pair. In the interlingua approach, the intermediate representation is entirely language independent. There are a number of benefits to this approach, but also disadvantages. The benefits are that in order to add a new language to an existing MT system, it is only necessary to write an analyser and generator for the new language, and not transfer rules between the new language and all the existing languages. The drawbacks are that it is very hard to define an interlingua which can truely represent all nuances of all natural languages, and in practice, interlingua systems are only used for limited translation domains.

Problems in machine translation Analysis

Form does not entirely determine content. This is also called the problem of ambiguity. The problem is that many sentences in natural language can have more than one interpretation, and these interpretations may be translated differently in different languages. Consider the following example from Spanish to Danish and German:


Synthesis Content does not entirely determine form. This is the problem that in a given language there is usually more than one way to communicate the same meaning for any given meaning.


All of these questions demand the same answer, but they may be more or less frequently used, or emphasise different things. In Apertium, for a given input sentence, one output sentence is produced. It is up to the designer of the translation system to choose which translation they want the system to produce. Often we recommend the most literal translation possible, as this reduces the necessity of transfer rules.

Transfer The same content is represented differently in different languages. Languages have different ways of expressing the same meaning. These ways are often incompatible between languages. Consider the following examples expressing the same content:


In Apertium, rules are applied which convert source language structure to target language structure using sequences of lexical forms as an intermediate representation. For further information see: Session 5: Structural transfer basics.