Difference between revisions of "Курсы машинного перевода для языков России/Session 3"

From Apertium
Jump to navigation Jump to search
(Created page with '{{TOCD}} The aim of this session is to give an overview of the issue of morphological ambiguity, and describe how it is treated in Constraint Grammar. We will give some theory r…')
 
Line 31: Line 31:
   
 
This is not however exclusive to Slavic languages, Romance languages also exhibit limited within part-of-speech ambiguity, consider the French ''temps'' (ambiguous singular and plural), and Irish ''fear'' (ambiguous singular nominative, plural genitive).
 
This is not however exclusive to Slavic languages, Romance languages also exhibit limited within part-of-speech ambiguity, consider the French ''temps'' (ambiguous singular and plural), and Irish ''fear'' (ambiguous singular nominative, plural genitive).
  +
  +
===Rule-based disambiguation===
  +
{{comment|TODO; find new examples}}
  +
There are many ways of writing disambiguation rules, but the most important thing is to be able to express the rule in terms of the context that provides disambiguation. For example, while individual words can be very ambiguous, often they are disambiguated by context. Take the Czech phrase ''našim starým přátelům'', while both ''nášim'' and ''starým'' are quite ambiguous (four analyses for ''nášim'' and seven for ''starým''), the head of the phrase ''přátelům'', which only has one analysis, disambiguates them.
  +
  +
* Náš<code>&lt;det&gt;&lt;pos&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code> starý<code>&lt;adj&gt;&lt;ma&gt;&lt;sg&gt;&lt;ins&gt;</code> přítel<code>&lt;n&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code>
  +
* Náš<code>&lt;det&gt;&lt;pos&gt;'''&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;'''</code> starý<code>&lt;adj&gt;'''&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;'''</code> přítel<code>&lt;n&gt;'''&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;'''</code>'''
  +
* Náš<code>&lt;det&gt;&lt;pos&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code> starý<code>&lt;adj&gt;&lt;mi&gt;&lt;sg&gt;&lt;ins&gt;</code> přítel<code>&lt;n&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code>
  +
* Náš<code>&lt;det&gt;&lt;pos&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code> starý<code>&lt;adj&gt;&lt;mi&gt;&lt;pl&gt;&lt;dat&gt;</code> přítel<code>&lt;n&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code>
  +
* Náš<code>&lt;det&gt;&lt;pos&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code> starý<code>&lt;adj&gt;&lt;f&gt;&lt;pl&gt;&lt;dat&gt;</code> přítel<code>&lt;n&gt;&lt;ma&gt;&lt;pl&gt;&lt;dat&gt;</code>
  +
* ...
  +
  +
We could thus conceive of writing a rule which removes analyse that do not agree with the head of the phrase.
  +
  +
====Constraint grammar====
  +
{{comment|TODO; find new examples}}
  +
One way of writing rules is with a formalism called constraint grammar. Constraint grammar rules consist of two parts, an operation on a pattern and a context. For example:
  +
  +
* {{sc|select}}: Given a context, remove all the readings apart from the one(s) matched by the pattern.
  +
* {{sc|remove}}: Given a context, remove the reading(s) that match the pattern.
  +
  +
A context can be any combination of words or tags in a given sentence. To get an idea of the kind of contexts that can be used, let's look at some real disambiguation rules. We're going to use the Polish phrase: ''Z usług żłobków korzysta 135 tysięcy pracujących matek''
  +
  +
;Ambiguity #1
  +
  +
The word ''żłobków'' is ambiguous between genitive and accusative. We want to select the genitive reading.
  +
  +
* <code>SELECT Gen IF (*-1 GENPREP BARRIER NOTGEN OR V OR Pr) (NOT 0 V-FIN);</code>
  +
** <code>SELECT Gen</code>: Select genitive, <code>IF</code>
  +
** <code>(*-1 GENPREP BARRIER NOTGEN OR V OR Pr)</code>: Before the current word there is a preposition which governs the genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag, or is a verb or preposition.
  +
** <code>(NOT 0 V-FIN)</code>: The current word cannot be a finite verb.
  +
  +
;Ambiguity #2
  +
  +
The word ''pracujących'' is ambiguous between masculine (genitive plural, locative plural, accusative plural), feminine (genitive plural, locative plural) and neuter (genitive plural, locative plural). We want to select the feminine (genitive plural) reading.
  +
  +
* <code>SELECT Gen IF (0C ACC-GEN-LOC) (*-1C Num LINK 1C Gen BARRIER NOTGEN);</code>
  +
** <code>SELECT Gen</code>: Select genitive, <code>IF</code>
  +
*** <code>(0C ACC-GEN-LOC)</code>: The current word can only be accusative, genitive or locative
  +
*** <code>(*-1C Num LINK 1C Gen BARRIER NOTGEN)</code>: Before the current word there is a word which can only be a numeral followed by a word which can only be genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag.
  +
  +
==Practice==

Revision as of 17:51, 18 December 2011

The aim of this session is to give an overview of the issue of morphological ambiguity, and describe how it is treated in Constraint Grammar. We will give some theory regarding different types of morphological ambiguity, and an overview of how they are dealt with using rules. The practice section will involve discovering tagging errors and trying to solve the errors with rules.

Theory

TODO; find new examples

The ambiguity that we are going to cover in this session is morphological ambiguity. This is the ambiguity that comes from a surface form having more than one possible morphological analysis (also referred to as homonymy — samenameness). For example, the Italian word pubblico can be,

  • The verb "pubblicare" in the present indicative tense, first person singular.
  • The masculine noun "pubblico" in the singular number.
  • The adjective "pubblico" inflected for masculine, singular.

The translational ambiguity that pubblico as a noun can be translated to German as Öffentlichkeit, Publikum, Zuschauer, etc. does not come into morphological ambiguity, and thus is not treated in this session.

Morphological ambiguity

There are two principle types of morphological ambiguity. The morphological ambiguity between parts of speech (for example a word that could be either noun or verb) and the morphological ambiguity within parts of speech (for example that a word form can only be a noun, but may be nominative or genitive). Typically the more complex morphology a language has, the higher the ratio of within part-of-speech ambiguity to between part-of-speech ambiguity.

Between parts-of-speech

TODO; find new examples

An example of ambiguity between parts-of-speech is given above, the word pubblico can be a noun, a verb or an adjective. Consider also the frequent ambiguity between adjectives denoting ethnic groups (e.g. in Spanish francés, búlgaro, italiano) and the nouns denoting the languages (el francés, el búlgaro, el italiano).

Within parts-of-speech

TODO; find new examples

For examples of ambiguity within parts-of-speech we can look at the Slavic languages, where there is a frequent syncretism between nominative, accusative and genitive.

  • Příkladem může být kondenzace vody.(cs)
  • Príkladom môže byť kondenzácia vody.(sk)
  • Przykładem może być kondensacja wody.(pl)
  • Primer je lahko kondenzacija vode.(sl)

This is not however exclusive to Slavic languages, Romance languages also exhibit limited within part-of-speech ambiguity, consider the French temps (ambiguous singular and plural), and Irish fear (ambiguous singular nominative, plural genitive).

Rule-based disambiguation

TODO; find new examples

There are many ways of writing disambiguation rules, but the most important thing is to be able to express the rule in terms of the context that provides disambiguation. For example, while individual words can be very ambiguous, often they are disambiguated by context. Take the Czech phrase našim starým přátelům, while both nášim and starým are quite ambiguous (four analyses for nášim and seven for starým), the head of the phrase přátelům, which only has one analysis, disambiguates them.

  • Náš<det><pos><ma><pl><dat> starý<adj><ma><sg><ins> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><ma><pl><dat> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><mi><sg><ins> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><mi><pl><dat> přítel<n><ma><pl><dat>
  • Náš<det><pos><ma><pl><dat> starý<adj><f><pl><dat> přítel<n><ma><pl><dat>
  • ...

We could thus conceive of writing a rule which removes analyse that do not agree with the head of the phrase.

Constraint grammar

TODO; find new examples

One way of writing rules is with a formalism called constraint grammar. Constraint grammar rules consist of two parts, an operation on a pattern and a context. For example:

  • select: Given a context, remove all the readings apart from the one(s) matched by the pattern.
  • remove: Given a context, remove the reading(s) that match the pattern.

A context can be any combination of words or tags in a given sentence. To get an idea of the kind of contexts that can be used, let's look at some real disambiguation rules. We're going to use the Polish phrase: Z usług żłobków korzysta 135 tysięcy pracujących matek

Ambiguity #1

The word żłobków is ambiguous between genitive and accusative. We want to select the genitive reading.

  • SELECT Gen IF (*-1 GENPREP BARRIER NOTGEN OR V OR Pr) (NOT 0 V-FIN);
    • SELECT Gen: Select genitive, IF
    • (*-1 GENPREP BARRIER NOTGEN OR V OR Pr): Before the current word there is a preposition which governs the genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag, or is a verb or preposition.
    • (NOT 0 V-FIN): The current word cannot be a finite verb.
Ambiguity #2

The word pracujących is ambiguous between masculine (genitive plural, locative plural, accusative plural), feminine (genitive plural, locative plural) and neuter (genitive plural, locative plural). We want to select the feminine (genitive plural) reading.

  • SELECT Gen IF (0C ACC-GEN-LOC) (*-1C Num LINK 1C Gen BARRIER NOTGEN);
    • SELECT Gen: Select genitive, IF
      • (0C ACC-GEN-LOC): The current word can only be accusative, genitive or locative
      • (*-1C Num LINK 1C Gen BARRIER NOTGEN): Before the current word there is a word which can only be a numeral followed by a word which can only be genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag.

Practice