Difference between revisions of "Курсы машинного перевода для языков России/Session 3"

From Apertium
Jump to navigation Jump to search
Line 19: Line 19:
 
====Between parts-of-speech====
 
====Between parts-of-speech====
 
{{comment|TODO; find new examples}}
 
{{comment|TODO; find new examples}}
An example of ambiguity between parts-of-speech is given above, the word ''pubblico'' can be a noun, a verb or an adjective. Consider also the frequent ambiguity between adjectives denoting ethnic groups (e.g. in Spanish ''francés'', ''búlgaro'', ''italiano'') and the nouns denoting the languages (''el francés'', ''el búlgaro'', ''el italiano'').
+
An example of ambiguity between parts-of-speech is given above, the word ''pubblico'' can be a noun, a verb or an adjective. Consider also the frequent ambiguity between adjectives denoting ethnic groups (e.g. in Russian ''русский'', ''xxx'', ''yyy'') and the nouns denoting the languages or people (''русский'', ''xxx'', ''yyy'').
   
 
====Within parts-of-speech====
 
====Within parts-of-speech====
Line 25: Line 25:
 
For examples of ambiguity within parts-of-speech we can look at the Slavic languages, where there is a frequent syncretism between nominative, accusative and genitive.
 
For examples of ambiguity within parts-of-speech we can look at the Slavic languages, where there is a frequent syncretism between nominative, accusative and genitive.
   
  +
* ''Примером может быть конденсация '''воды'''.''{{slc|ru}}
 
* ''Příkladem může být kondenzace '''vody'''.''{{slc|cs}}
 
* ''Příkladem může být kondenzace '''vody'''.''{{slc|cs}}
 
* ''Príkladom môže byť kondenzácia '''vody'''.''{{slc|sk}}
 
* ''Príkladom môže byť kondenzácia '''vody'''.''{{slc|sk}}
Line 35: Line 36:
   
 
{{comment|TODO; write something about syntactic ambiguity}}
 
{{comment|TODO; write something about syntactic ambiguity}}
  +
Another form of ambiguity is syntactic ambiguity, that is where a sentence or phrase has more than one possible interpretation that cannot be resolved morphologically. An example might be prepositional phrase attachment, where it is not necessarily clear from the morphology to which constituent the prepositional phrase attaches. Consider the following example
  +
  +
* The boy saw the girl with the telescope (possession)
  +
* The boy saw the girl with the telescope (instrument)
  +
  +
{{comment|BETTER EXAMPLE}}
  +
  +
A problem arises for machine translation when a sentence or phrase is ambiguous in the source language, but not in the target language. If the ambiguity is preserved there is less problem.
  +
   
 
===Rule-based disambiguation===
 
===Rule-based disambiguation===

Revision as of 11:31, 24 December 2011

The aim of this session is to give an overview of the issue of morphological ambiguity, and describe how it is treated in Constraint Grammar. We will give some theory regarding different types of morphological ambiguity, and an overview of how they are dealt with using rules. The practice section will involve discovering tagging errors and trying to solve the errors with rules.

Theory

TODO; find new examples

The ambiguity that we are going to cover in this session is morphological ambiguity. This is the ambiguity that comes from a surface form having more than one possible morphological analysis (also referred to as homonymy — samenameness). For example, the Italian word pubblico can be,

  • The verb "pubblicare" in the present indicative tense, first person singular.
  • The masculine noun "pubblico" in the singular number.
  • The adjective "pubblico" inflected for masculine, singular.

The translational ambiguity that pubblico as a noun can be translated to German as Öffentlichkeit, Publikum, Zuschauer, etc. does not come into morphological ambiguity, and thus is not treated in this session.

Morphological ambiguity

There are two principle types of morphological ambiguity. The morphological ambiguity between parts of speech (for example a word that could be either noun or verb) and the morphological ambiguity within parts of speech (for example that a word form can only be a noun, but may be nominative or genitive). Typically the more complex morphology a language has, the higher the ratio of within part-of-speech ambiguity to between part-of-speech ambiguity.

Between parts-of-speech

TODO; find new examples

An example of ambiguity between parts-of-speech is given above, the word pubblico can be a noun, a verb or an adjective. Consider also the frequent ambiguity between adjectives denoting ethnic groups (e.g. in Russian русский, xxx, yyy) and the nouns denoting the languages or people (русский, xxx, yyy).

Within parts-of-speech

TODO; find new examples

For examples of ambiguity within parts-of-speech we can look at the Slavic languages, where there is a frequent syncretism between nominative, accusative and genitive.

  • Примером может быть конденсация воды.(ru)
  • Příkladem může být kondenzace vody.(cs)
  • Príkladom môže byť kondenzácia vody.(sk)
  • Przykładem może być kondensacja wody.(pl)
  • Primer je lahko kondenzacija vode.(sl)

This is not however exclusive to Slavic languages, Romance languages also exhibit limited within part-of-speech ambiguity, consider the French temps (ambiguous singular and plural), and Irish fear (ambiguous singular nominative, plural genitive).

Syntactic ambiguity

TODO; write something about syntactic ambiguity

Another form of ambiguity is syntactic ambiguity, that is where a sentence or phrase has more than one possible interpretation that cannot be resolved morphologically. An example might be prepositional phrase attachment, where it is not necessarily clear from the morphology to which constituent the prepositional phrase attaches. Consider the following example

  • The boy saw the girl with the telescope (possession)
  • The boy saw the girl with the telescope (instrument)

BETTER EXAMPLE

A problem arises for machine translation when a sentence or phrase is ambiguous in the source language, but not in the target language. If the ambiguity is preserved there is less problem.


Rule-based disambiguation

TODO; find new examples

There are many ways of writing disambiguation rules, but the most important thing is to be able to express the rule in terms of the context that provides disambiguation. For example, while individual words can be very ambiguous, often they are disambiguated by context. Take the Czech phrase našim starým přátelům, while both nášim and starým are quite ambiguous (four analyses for nášim and seven for starým), the head of the phrase přátelům, which only has one analysis, disambiguates them.

  • Náš<det><pos><ma><pl><dat> starý<adj><ma><sg><ins> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><ma><pl><dat> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><mi><sg><ins> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><mi><pl><dat> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><f><pl><dat> přítel<n><ma><pl><dat>
  • ...

We could thus conceive of writing a rule which removes analyse that do not agree with the head of the phrase.

Constraint grammar

TODO; find new examples

One way of writing rules is with a formalism called constraint grammar. Constraint grammar rules consist of two parts, an operation on a pattern and a context. For example:

  • select: Given a context, remove all the readings apart from the one(s) matched by the pattern.
  • remove: Given a context, remove the reading(s) that match the pattern.

A context can be any combination of words or tags in a given sentence. To get an idea of the kind of contexts that can be used, let's look at some real disambiguation rules. We're going to use the Russian phrase:

« Услугами детских садов пользуются 135 тысяч работающих матерей. »
Ambiguity #1

The word детских is ambiguous between genitive, prepositional and accusative. We want to select the genitive reading.

  • SELECT Gen IF (0C A) (*1C GEN BARRIER NPNHA);
    • SELECT Gen: Select genitive, IF
    • (0C A): The current word only has adjective readings.
    • (*1C GEN BARRIER NPNHA): After the current word there is a word which is only in the genitive case. Keep searching the words after the current word until a word which is any word except a word which can modify a noun, or an adverb.
Ambiguity #2

The word работающих is ambiguous between genitive plural, prepositional plural, accusative plural. We want to select the genitive plural reading.

  • SELECT Gen IF (0C ACC-GEN-LOC) (*-1C Num LINK 1C Gen BARRIER NOTGEN);
    • SELECT Gen: Select genitive, IF
      • (0C ACC-GEN-LOC): The current word can only be accusative, genitive or locative
      • (*-1C Num LINK 1C Gen BARRIER NOTGEN): Before the current word there is a word which can only be a numeral followed by a word which can only be genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag.
Ambiguity #3

TODO; check, is this tagging right?

The word матери is ambiguous between genitive plural and accusative plural. We want to select the genitive plural reading.

  • REMOVE Acc IF (0C ACC-OR-GEN) (*-1C Num LINK 1C Gen BARRIER NOTGEN);

Practice

In this practice session, we are going to run the tagger, discover some tagging errors and finally propose some disambiguation rules for solving these tagging errors.


Finding ambiguities

TODO; write about how to run the morphological analyser, and how to find frequent ambiguities

It is fairly easy to find the errors in the tagging, there are two:

  • The personal pronoun reading for je is selected instead of the verb.
  • The ordinal sedmý should agree with the noun planeta in number and case.

The rest of the words have been disambiguated correctly.


Conceiving rules

When thinking about rules, it is important to take into account the following:

  • The scope of application of the rule. Should it work just on the current word, the current word and one or two words context either side, at the level of the clause, or on the whole sentence ?
  • Should the rule only apply to one lemma, or should it apply to any word that has the same part of speech ?
  • The kind of ambiguity we want the rule to work on, for example, should it work on any ambiguity between nominative and accusative, or only ambiguities in case involving nouns ?
  • There are often several rules which could equally well disambiguate a sentence, it is important not to get caught up in finding the "perfect" one.

In the above example, we could consider the following rules:

  • If a word can only be an ordinal determiner and the following word is a noun with a single possible analysis, then make the determiner agree in gender, number and case with the noun.
  • If a word is je as an accusative pronoun or present tense third person singular form of the verb být, then remove the pronoun reading if there is a word in the sentence that is only nominative and agrees with the verb, and there is no other finite verb in the sentence.

For your chosen language pair, describe some rules that solve specific disambiguation problems you have found.

Constraint grammar

TODO; find new examples

If you have finished with describing the rules, you can try coding them in constraint grammar, as in the Polish example above. Below is a skeleton constraint grammar file to encode the two rules from the previous example, and instructions on how to run it.

DELIMITERS = "<.>" "<!>" "<?>" ;
SOFT-DELIMITERS = "<,>" ;

LIST BOS = (>>>) ; # Beginning of sentence
LIST EOS = (<<<) ; # End of sentence

LIST GENDER = m f nt mf mi ma ;
LIST NUMBER = sg pl sp ;
LIST CASE = nom gen dat acc ins loc ;

SET V-FIN = (pres) | (past) ;

SECTION

# Rule 1
SELECT $$GENDER + $$NUMBER + $$CASE IF                  # choose a given gender/number/case combination if
              (0C (det ord))                            #   the current word is an ordinal
              (1C (n) + $$GENDER + $$NUMBER + $$CASE);  #   the following word is a noun in the same gender/number/case

# Rule 2
REMOVE (prn) IF                                         # remove a pronoun reading of je if
              (0 ("<je>"))                              #   the current surface form is 'je'
              (0 (prn acc) OR ("být" pres p3 sg))       #   the current word can be either an accusative pronoun or copula 
              ((-1C* (nom sg)) OR (1C* (nom sg)))       #   there is a nominative singular before or after
              (NOT -1* V-FIN) (NOT 1* V-FIN);           #   there is no other finite verb in the sentence

Copy this file into a text editor, and save it as rules.rlx. First we need to compile the rules:

$ cg-comp rules.rlx rules.bin
Sections: 1, Rules: 2, Sets: 29, Tags: 36

And now run them:

$ echo "Uran je sedmá planeta od Slunce." | lt-proc -a cs-pl.automorf.bin | cg-proc rules.bin
^Uran/Uran<n><mi><sg><nom>/Uran<n><mi><sg><acc>$ ^je/být<vbser><pres><p3><sg>$ ^sedmá/sedmý<det><ord><f><sg><nom>$ 
^planeta/planeta<n><f><sg><nom>$ ^od/od<pr>$ 
^Slunce/Slunce<n><nt><sg><voc>/Slunce<n><nt><sg><nom>/Slunce<n><nt><sg><acc>/Slunce<n><nt><sg><gen>/Slunce<n><nt><pl><voc>
       /Slunce<n><nt><pl><nom>/Slunce<n><nt><pl><acc>$^./.<sent>$

And in conjunction with the apertium-tagger:

$ echo "Uran je sedmá planeta od Slunce." | lt-proc -a cs-pl.automorf.bin | cg-proc rules.bin |\
   apertium-tagger -g cs-pl.prob
^Uran<n><mi><sg><nom>$ ^být<vbser><pres><p3><sg>$ ^sedmý<det><ord><f><sg><nom>$ 
^planeta<n><f><sg><nom>$ ^od<pr>$ ^Slunce<n><nt><sg><gen>$^.<sent>$

to get a fully disambiguated sentence.

Further reading

  • van Halteren, H. (1999) Syntactic wordclass tagging (Dordrecht: Kluwer)