Difference between revisions of "Курсы машинного перевода для языков России/Session 3"
(36 intermediate revisions by 4 users not shown) | |||
Line 4: | Line 4: | ||
==Theory== |
==Theory== |
||
The ambiguity that we are going to cover in this session is morphological ambiguity. This is the ambiguity that comes from a surface form having more than one possible morphological analysis (also referred to as ''homonymy'' — samenameness). For example, the Chuvash word ''чалка'' can be, |
|||
{{comment|TODO; find new examples}} |
|||
The ambiguity that we are going to cover in this session is morphological ambiguity. This is the ambiguity that comes from a surface form having more than one possible morphological analysis (also referred to as ''homonymy'' — samenameness). For example, the Italian word ''pubblico'' can be, |
|||
* The |
* The noun "чалка" in the singular number and nominative case. |
||
* The adjective "чалка". |
|||
* The masculine noun "pubblico" in the singular number. |
|||
* The |
* The verb "чалка" in imperative, 2nd person, singular. |
||
The translational ambiguity that '' |
The translational ambiguity that ''чалка'' as a verbs can be translated to Russian as ''кричать'', ''стрекотать'', ''верещать'', ''галдеть'' etc. does not come into morphological ambiguity, and thus is not treated in this session. |
||
===Morphological ambiguity=== |
===Morphological ambiguity=== |
||
Line 18: | Line 17: | ||
====Between parts-of-speech==== |
====Between parts-of-speech==== |
||
{{comment|TODO; find new examples}} |
|||
An example of ambiguity between parts-of-speech is given above, the word '' |
An example of ambiguity between parts-of-speech is given above, the word ''чалка'' can be a noun, a verb or an adjective. Consider also the frequent ambiguity between adjectives denoting ethnic groups (e.g. in Russian ''русский'') or professions (e.g. in Russian ''военный'') and the nouns denoting the languages or people (''русский'') or professions (''военный''). |
||
In Turkic and Uralic languages, nouns and verbs often share the same suffixes (e.g. Finnish ''-n, -t, -i''). Similar stems (accidental homonymy) thus shows up in several parts of the paradigm, e.g. Finnish ''tule-'' "to come" and ''tuli'' "fire": ''tuli'' = "fire" Nom Sg or "s/he came", ''tulet'' = "the fires" or "you come"; ''tulen" = "of the fire" or "I come". |
|||
Derivational processes may also overlap with inflection. Thus, in Finnish, the plural of present participle (hence adjective) and the 3rd person present plural are always identical: ''laulavat'' = they sing / the singing (ones), ''tulevat'' = they come / the coming (ones), cf. ''Laulavat baritonit laulavat usein'' (the singing baritons often sing). |
|||
In Kyrgyz, some examples are ''ак'' "white" and ''ак'' "to flow/run (of liquids)", with ambiguous forms like ''акты'' "the white one (acc.)" and "акты" "it flowed". Another example is Kazakh ''жай'', with the following meanings: as an adjective "slow, simple, quiet, late"; as an adverb "slowly, simply, quietly"; as a noun "lightning", "reason, condition", "habitation", "bow (weapon)"; as a verb "hang out to dry / lay out / spread out (развешивать, расстилать)", "bring animals to pasture". |
|||
====Within parts-of-speech==== |
====Within parts-of-speech==== |
||
{{comment|TODO; find new examples}} |
|||
For examples of ambiguity within parts-of-speech we can look at the Slavic languages, where there is a frequent syncretism between nominative, accusative and genitive. |
For examples of ambiguity within parts-of-speech we can look at the Slavic languages, where there is a frequent syncretism between nominative, accusative and genitive. |
||
* '' |
* ''Примером может быть конденсация '''воды'''.''{{slc|ru}} |
||
* ''Príkladom môže byť kondenzácia '''vody'''.''{{slc|sk}} |
* ''Príkladom môže byť kondenzácia '''vody'''.''{{slc|sk}} |
||
* ''Przykładem może być kondensacja '''wody'''.''{{slc|pl}} |
* ''Przykładem może być kondensacja '''wody'''.''{{slc|pl}} |
||
* ''Primer je lahko kondenzacija '''vode'''.''{{slc|sl}} |
* ''Primer je lahko kondenzacija '''vode'''.''{{slc|sl}} |
||
This is not however exclusive to Slavic languages, |
This is not however exclusive to Slavic languages, Turkic and Finno-Ugric languages also exhibit limited within part-of-speech ambiguity, consider the Chuvash ''итлĕр'' (ambiguous future and imperative), and Finnish ''voivat'' ("they can, they could"; this present/past ambiguity holds for all monosyllabic i-final verbs, like for example ''voida'' "could" ''soida'' "ring", ''naida'' "fuck", etc.). |
||
In Komi, there is a systematic ambiguity between 1st and 3rd person singular in the past tense. In the sentence below, the form ''кывлі'' can be 1st or 3rd person. |
|||
:''Ме кывлі, тэ пӧ уджалан вузасянінын.''{{slc|kv}} |
|||
:Я слышала, что ты работаешь в магазине. |
|||
In Kazakh, there is also a systematic ambiguity between singular and plural in all 3rd person verb forms. For example, "барады" could mean "s/he/it goes" or "they go". |
|||
===Syntactic ambiguity=== |
===Syntactic ambiguity=== |
||
Another form of ambiguity is syntactic ambiguity, that is where a sentence or phrase has more than one possible interpretation that cannot be resolved morphologically. An example might be prepositional phrase attachment, where it is not necessarily clear from the morphology to which constituent the prepositional phrase attaches. Consider the following example |
|||
{{comment|TODO; write something about syntactic ambiguity}} |
|||
* Сегодня я говорил с подругой Анны, с которой я познакомился вчера. (met Anna yesterday) |
|||
* Сегодня я говорил с подругой Анны, с которой я познакомился вчера. (met the friend of Anna yesterday) |
|||
A problem arises for machine translation when a sentence or phrase is ambiguous in the source language, but not in the target language. If the ambiguity is preserved there is less problem. |
|||
For example for the above sentence, the ambiguity is retained in the majority of western-European languages (e.g. Indo-European), but for XXX the two different readings require different translations: |
|||
* xxx |
|||
* yyy |
|||
===Rule-based disambiguation=== |
===Rule-based disambiguation=== |
||
There are many ways of writing disambiguation rules, but the most important thing is to be able to express the rule in terms of the context that provides disambiguation. For example, while individual words can be very ambiguous, often they are disambiguated by context. Take the Russian phrase '' « нашим старым преподавателям »'', while both ''нашим'' and ''старым'' are quite ambiguous (three analyses for both), the head of the phrase ''преподавателям'', which only has one analysis, disambiguates them. |
|||
{{comment|TODO; find new examples}} |
|||
There are many ways of writing disambiguation rules, but the most important thing is to be able to express the rule in terms of the context that provides disambiguation. For example, while individual words can be very ambiguous, often they are disambiguated by context. Take the Czech phrase ''našim starým přátelům'', while both ''nášim'' and ''starým'' are quite ambiguous (four analyses for ''nášim'' and seven for ''starým''), the head of the phrase ''přátelům'', which only has one analysis, disambiguates them. |
|||
* наш{{tag|det}}{{tag|pos}}{{tag|m}}{{tag|sg}}{{tag|ins}} старый{{tag|adj}}{{tag|m}}{{tag|sg}}{{tag|ins}} преподаватель{{tag|n}}{{tag|m}}{{tag|aa}}{{tag|pl}}{{tag|dat}} |
|||
* Náš<code><det><pos><ma><pl><dat></code> starý<code><adj><ma><sg><ins></code> přítel<code><n><ma><pl><dat></code> |
|||
* наш{{tag|det}}{{tag|pos}}{{tag|nt}}{{tag|sg}}{{tag|ins}} старый{{tag|adj}}{{tag|m}}{{tag|sg}}{{tag|ins}} преподаватель{{tag|n}}{{tag|m}}{{tag|aa}}{{tag|pl}}{{tag|dat}} |
|||
* Náš<code><det><pos>'''<ma><pl><dat>'''</code> starý<code><adj>'''<ma><pl><dat>'''</code> přítel<code><n>'''<ma><pl><dat>'''</code>''' |
|||
* наш{{tag|det}}{{tag|pos}}{{tag|mfn}}{{tag|pl}}{{tag|dat}} старый{{tag|adj}}{{tag|m}}{{tag|sg}}{{tag|ins}} преподаватель{{tag|n}}{{tag|m}}{{tag|aa}}{{tag|pl}}{{tag|dat}} |
|||
* Náš<code><det><pos><ma><pl><dat></code> starý<code><adj><mi><sg><ins></code> přítel<code><n><ma><pl><dat></code> |
|||
* наш{{tag|det}}{{tag|pos}}{{tag|m}}{{tag|sg}}{{tag|ins}} старый{{tag|adj}}{{tag|mfn}}{{tag|pl}}{{tag|dat}} преподаватель{{tag|n}}{{tag|m}}{{tag|aa}}{{tag|pl}}{{tag|dat}} |
|||
* Náš<code><det><pos><ma><pl><dat></code> starý<code><adj><mi><pl><dat></code> přítel<code><n><ma><pl><dat></code> |
|||
* наш{{tag|det}}{{tag|pos}}{{tag|nt}}{{tag|sg}}{{tag|ins}} старый{{tag|adj}}{{tag|mfn}}{{tag|pl}}{{tag|dat}} преподаватель{{tag|n}}{{tag|m}}{{tag|aa}}{{tag|pl}}{{tag|dat}} |
|||
* Náš<code><det><pos><ma><pl><dat></code> starý<code><adj><f><pl><dat></code> přítel<code><n><ma><pl><dat></code> |
|||
* '''наш{{tag|det}}{{tag|pos}}{{tag|mfn}}{{tag|pl}}{{tag|dat}} старый{{tag|adj}}{{tag|mfn}}{{tag|pl}}{{tag|dat}} преподаватель{{tag|n}}{{tag|m}}{{tag|aa}}{{tag|pl}}{{tag|dat}}''' |
|||
* ... |
* ... |
||
We could thus conceive of writing a rule which removes |
We could thus conceive of writing a rule which removes the analyses that do not agree with the head of the phrase. |
||
====Constraint grammar==== |
====Constraint grammar==== |
||
{{comment|TODO; find new examples}} |
|||
One way of writing rules is with a formalism called constraint grammar. Constraint grammar rules consist of two parts |
One way of writing rules is with a formalism called constraint grammar. Constraint grammar rules consist of two parts: an operation on a pattern, and a context. The following are examples of operations: |
||
* {{sc|select}}: Given a context, remove all the readings apart from the one(s) matched by the pattern. |
* {{sc|select}}: Given a context, remove all the readings apart from the one(s) matched by the pattern. |
||
* {{sc|remove}}: Given a context, remove the reading(s) that match the pattern. |
* {{sc|remove}}: Given a context, remove the reading(s) that match the pattern. |
||
A context can be any combination of words or tags in a given sentence. To get an idea of the kind of contexts that can be used, let's look at some real disambiguation rules. We're going to use the |
A context can be any combination of words or tags in a given sentence. To get an idea of the kind of contexts that can be used, let's look at some real disambiguation rules. We're going to use the Russian phrase: |
||
<center>« ''Услугами детских садов пользуются 135 тысяч работающих матерей.'' »</center> |
|||
<!-- ''Z usług żłobków korzysta 135 tysięcy pracujących matek'' --> |
|||
;Ambiguity #1 |
;Ambiguity #1 |
||
The word '' |
The word ''детских'' is ambiguous between genitive, prepositional and accusative. We want to select the genitive reading. |
||
* <code>SELECT Gen IF ( |
* <code>SELECT Gen IF (0C A) (*1C GEN BARRIER NPNHA);</code> |
||
** <code>SELECT Gen</code>: Select genitive, <code>IF</code> |
** <code>SELECT Gen</code>: Select genitive, <code>IF</code> |
||
** <code>(0C A)</code>: The current word ''only'' has adjective readings. ( C means "careful" . 0 means the same position as the target word. A refers to a set, which should be defined earlier in the file. [http://alpha.visl.sdu.dk/~tino/pisg/freenode/logs/%23apertium_20131117.log *] ) |
|||
** <code>(*-1 GENPREP BARRIER NOTGEN OR V OR Pr)</code>: Before the current word there is a preposition which governs the genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag, or is a verb or preposition. |
|||
** <code>(*1C GEN BARRIER NPNHA)</code>: After the current word there is a word which is only in the genitive case. Keep searching the words after the current word until a word which is any word except a word which can modify a noun, or an adverb. |
|||
** <code>(NOT 0 V-FIN)</code>: The current word cannot be a finite verb. |
|||
;Ambiguity #2 |
;Ambiguity #2 |
||
The word '' |
The word ''работающих'' is ambiguous between genitive plural, prepositional plural, accusative plural. We want to select the genitive plural reading. |
||
* <code>SELECT Gen IF (0C ACC-GEN- |
* <code>SELECT Gen IF (0C ACC-GEN-PRP) (*-1C Num LINK 1C Gen BARRIER NOTGEN);</code> |
||
** <code>SELECT Gen</code>: Select genitive, <code>IF</code> |
** <code>SELECT Gen</code>: Select genitive, <code>IF</code> |
||
*** <code>(0C ACC-GEN- |
*** <code>(0C ACC-GEN-PRP)</code>: The current word can only be accusative, genitive or locative |
||
*** <code>(*-1C Num LINK 1C Gen BARRIER NOTGEN)</code>: Before the current word there is a word which can only be a numeral followed by a word which can only be genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag. |
*** <code>(*-1C Num LINK 1C Gen BARRIER NOTGEN)</code>: Before the current word there is a word which can only be a numeral followed by a word which can only be genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag. |
||
;Ambiguity #3 |
|||
The word ''матери'' is ambiguous between genitive plural and accusative plural. We want to select the genitive plural reading (or better, remove the accusative). |
|||
* <code>REMOVE Acc IF (0C ACC-OR-GEN) (*-1C Num LINK 1C Gen BARRIER NOTGEN);</code> |
|||
** <code>REMOVE Acc</code>: Remove accusative, <code>IF</code> |
|||
*** <code>(0C ACC-OR-GEN)</code>: The current word can only be accusative or genitive |
|||
*** <code>(*-1C Num LINK 1C Gen BARRIER NOTGEN)</code>: There is a previous numeral which has a word which can only be genitive after it (e.g. ''тысяч'') and we stop searching when we find a word that is not in genitive (<code>BARRIER NOTGEN</code>) |
|||
==Practice== |
==Practice== |
||
Line 80: | Line 113: | ||
In this practice session, we are going to run the tagger, discover some tagging errors and finally propose some disambiguation rules for solving these tagging errors. |
In this practice session, we are going to run the tagger, discover some tagging errors and finally propose some disambiguation rules for solving these tagging errors. |
||
===Running the morphological analyser=== |
|||
Let's try running the morphological analyser on what seems like a fairly simple phrase: |
|||
===Finding ambiguities=== |
|||
<pre> |
|||
{{comment|TODO; write about how to run the morphological analyser, and how to find frequent ambiguities}} |
|||
echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin |
|||
^Уран/Уран<np><top><m><sg><acc>/Уран<np><top><m><sg><nom>$ |
|||
^—/—<guio>$ ^седьмая/седьмой<det><ord><f><sg><nom>$ |
|||
^по/по<pr>$ |
|||
^удалённости/удалённость<n><f><nn><sg><gen>/удалённость<n><f><nn><sg><dat>/удалённость<n><f><nn><sg><prp>/удалённость<n><f><nn><pl><acc>/удалённость<n><f><nn><pl><nom>$ |
|||
^от/от<pr>$ |
|||
^Солнца/Солнце<n><nt><nn><sg><gen>/Солнце<n><nt><nn><pl><acc>/Солнце<n><nt><nn><pl><nom>$ |
|||
^./.<sent>$ |
|||
</pre> |
|||
What seemed like a fairly simple phrase is not, there are multiple ambiguities. |
|||
===Finding errors=== |
|||
Apertium comes with a statistical tagger, which can be run as follows: |
|||
<pre> |
|||
$ echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin | apertium-tagger -p -g kv-ru.prob |
|||
^Уран/Уран<np><top><m><sg><acc>$ |
|||
^—/—<guio>$ |
|||
^седьмая/седьмой<det><ord><f><sg><nom>$ |
|||
^по/по<pr>$ ^удалённости/удалённость<n><f><nn><sg><gen>$ |
|||
^от/от<pr>$ ^Солнца/Солнце<n><nt><nn><sg><gen>$^./.<sent>$ |
|||
</pre> |
|||
It is fairly easy to find the errors in the tagging, there are two: |
It is fairly easy to find the errors in the tagging, there are two: |
||
* "Уран" should be nominative, not accusative. |
|||
* The personal pronoun reading for ''je'' is selected instead of the verb. |
|||
* After the preposition "по", the case of "удалённость" should be dative, not genitive. |
|||
* The ordinal ''sedmý'' should agree with the noun ''planeta'' in number and case. |
|||
The rest of the words have been disambiguated correctly. |
The rest of the words have been disambiguated correctly. |
||
===Conceiving rules=== |
===Conceiving rules=== |
||
Line 104: | Line 161: | ||
In the above example, we could consider the following rules: |
In the above example, we could consider the following rules: |
||
* If a word can only be |
* If a word can only be nominative or accusative and the following word is the hyphen (—) with an adjective that can only be in nominative following, then select the nominative reading. |
||
* After the preposition "по" remove any genitive readings. |
|||
* If a word is ''je'' as an accusative pronoun or present tense third person singular form of the verb ''být'', then remove the pronoun reading if there is a word in the sentence that is only nominative and agrees with the verb, and there is no other finite verb in the sentence. |
|||
For your chosen language pair, describe some rules that solve specific disambiguation problems you have found. |
For your chosen language pair, describe some rules that solve specific disambiguation problems you have found. |
||
===Constraint grammar=== |
===Constraint grammar=== |
||
{{comment|TODO; find new examples}} |
|||
If you have finished with describing the rules, you can try coding them in constraint grammar, as in the |
If you have finished with describing the rules, you can try coding them in constraint grammar, as in the Russian example above. Below is a skeleton constraint grammar file to encode the two rules from the previous example, and instructions on how to run it. |
||
<pre> |
<pre> |
||
Line 120: | Line 177: | ||
LIST EOS = (<<<) ; # End of sentence |
LIST EOS = (<<<) ; # End of sentence |
||
LIST |
LIST Hyphen = guio ; |
||
LIST |
LIST Nom = nom ; |
||
LIST |
LIST Acc = acc ; |
||
LIST Gen = gen ; |
|||
LIST Not-Gen-Prep = "по"; |
|||
SET |
SET Acc-Or-Nom = Acc | Nom ; |
||
SECTION |
SECTION |
||
# Rule 1 |
# Rule 1 |
||
REMOVE Acc IF # Remove accusative reading if, |
|||
(0C |
(0C Acc-Or-Nom) # the current word is only accusative or nominative |
||
(1C |
(1C Hyphen LINK 1 Nom); # there is a hyphen directly to the right, with a nominative following |
||
# Rule 2 |
# Rule 2 |
||
REMOVE |
REMOVE Gen IF # Remove a genitive reading if, |
||
( |
(-1C Not-Gen-Prep); # it is preceeded by a preposition which does not ever govern the genitive |
||
(0 (prn acc) OR ("být" pres p3 sg)) # the current word can be either an accusative pronoun or copula |
|||
((-1C* (nom sg)) OR (1C* (nom sg))) # there is a nominative singular before or after |
|||
(NOT -1* V-FIN) (NOT 1* V-FIN); # there is no other finite verb in the sentence |
|||
</pre> |
</pre> |
||
Line 146: | Line 203: | ||
<pre> |
<pre> |
||
$ cg-comp rules.rlx rules.bin |
$ cg-comp rules.rlx rules.bin |
||
Sections: 1, Rules: 2, Sets: 29, Tags: 36 |
|||
Sections: 1, Rules: 2, Sets: 17, Tags: 18 |
|||
</pre> |
</pre> |
||
Line 152: | Line 211: | ||
<pre> |
<pre> |
||
$ echo " |
$ echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin | cg-proc rules.bin |
||
^ |
^Уран/Уран<np><top><m><sg><nom>$ ^—/—<guio>$ ^седьмая/седьмой<det><ord><f><sg><nom>$ |
||
^по/по<pr>$ |
|||
^planeta/planeta<n><f><sg><nom>$ ^od/od<pr>$ |
|||
^ |
^удалённости/удалённость<n><f><nn><pl><nom>/удалённость<n><f><nn><sg><dat>/удалённость<n><f><nn><sg><prp>/удалённость<n><f><nn><pl><acc>$ |
||
^от/от<pr>$ ^Солнца/Солнце<n><nt><nn><sg><gen>/Солнце<n><nt><nn><pl><acc>/Солнце<n><nt><nn><pl><nom>$ |
|||
^./.<sent>$ |
|||
</pre> |
</pre> |
||
Line 162: | Line 223: | ||
<pre> |
<pre> |
||
$ echo " |
$ echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin | cg-proc rules.bin |\ |
||
apertium-tagger -g |
apertium-tagger -g ru-kv.prob |
||
^ |
^Уран<np><top><m><sg><nom>$ ^—<guio>$ ^седьмой<det><ord><f><sg><nom>$ ^по<pr>$ |
||
^ |
^удалённость<n><f><nn><pl><nom>$ ^от<pr>$ ^Солнце<n><nt><nn><sg><gen>$^.<sent>$ |
||
</pre> |
</pre> |
||
We can see that although we've removed the incorrect genitive, now we get an incorrect plural nominative reading. As "по" never governs the nominative, we can remove that too, using a similar rule to the second one. |
|||
to get a fully disambiguated sentence. |
|||
==Further reading== |
==Further reading== |
||
* van Halteren, H. (1999) ''Syntactic wordclass tagging'' (Dordrecht: Kluwer) |
* van Halteren, H. (1999) ''Syntactic wordclass tagging'' (Dordrecht: Kluwer) |
||
[[Category:Машинный перевод для языков России|Session 3]] |
Latest revision as of 19:45, 18 November 2013
Contents |
The aim of this session is to give an overview of the issue of morphological ambiguity, and describe how it is treated in Constraint Grammar. We will give some theory regarding different types of morphological ambiguity, and an overview of how they are dealt with using rules. The practice section will involve discovering tagging errors and trying to solve the errors with rules.
Theory[edit]
The ambiguity that we are going to cover in this session is morphological ambiguity. This is the ambiguity that comes from a surface form having more than one possible morphological analysis (also referred to as homonymy — samenameness). For example, the Chuvash word чалка can be,
- The noun "чалка" in the singular number and nominative case.
- The adjective "чалка".
- The verb "чалка" in imperative, 2nd person, singular.
The translational ambiguity that чалка as a verbs can be translated to Russian as кричать, стрекотать, верещать, галдеть etc. does not come into morphological ambiguity, and thus is not treated in this session.
Morphological ambiguity[edit]
There are two principle types of morphological ambiguity. The morphological ambiguity between parts of speech (for example a word that could be either noun or verb) and the morphological ambiguity within parts of speech (for example that a word form can only be a noun, but may be nominative or genitive). Typically the more complex morphology a language has, the higher the ratio of within part-of-speech ambiguity to between part-of-speech ambiguity.
Between parts-of-speech[edit]
An example of ambiguity between parts-of-speech is given above, the word чалка can be a noun, a verb or an adjective. Consider also the frequent ambiguity between adjectives denoting ethnic groups (e.g. in Russian русский) or professions (e.g. in Russian военный) and the nouns denoting the languages or people (русский) or professions (военный).
In Turkic and Uralic languages, nouns and verbs often share the same suffixes (e.g. Finnish -n, -t, -i). Similar stems (accidental homonymy) thus shows up in several parts of the paradigm, e.g. Finnish tule- "to come" and tuli "fire": tuli = "fire" Nom Sg or "s/he came", tulet = "the fires" or "you come"; tulen" = "of the fire" or "I come".
Derivational processes may also overlap with inflection. Thus, in Finnish, the plural of present participle (hence adjective) and the 3rd person present plural are always identical: laulavat = they sing / the singing (ones), tulevat = they come / the coming (ones), cf. Laulavat baritonit laulavat usein (the singing baritons often sing).
In Kyrgyz, some examples are ак "white" and ак "to flow/run (of liquids)", with ambiguous forms like акты "the white one (acc.)" and "акты" "it flowed". Another example is Kazakh жай, with the following meanings: as an adjective "slow, simple, quiet, late"; as an adverb "slowly, simply, quietly"; as a noun "lightning", "reason, condition", "habitation", "bow (weapon)"; as a verb "hang out to dry / lay out / spread out (развешивать, расстилать)", "bring animals to pasture".
Within parts-of-speech[edit]
For examples of ambiguity within parts-of-speech we can look at the Slavic languages, where there is a frequent syncretism between nominative, accusative and genitive.
- Примером может быть конденсация воды.(
ru
) - Príkladom môže byť kondenzácia vody.(
sk
) - Przykładem może być kondensacja wody.(
pl
) - Primer je lahko kondenzacija vode.(
sl
)
This is not however exclusive to Slavic languages, Turkic and Finno-Ugric languages also exhibit limited within part-of-speech ambiguity, consider the Chuvash итлĕр (ambiguous future and imperative), and Finnish voivat ("they can, they could"; this present/past ambiguity holds for all monosyllabic i-final verbs, like for example voida "could" soida "ring", naida "fuck", etc.).
In Komi, there is a systematic ambiguity between 1st and 3rd person singular in the past tense. In the sentence below, the form кывлі can be 1st or 3rd person.
- Ме кывлі, тэ пӧ уджалан вузасянінын.(
kv
) - Я слышала, что ты работаешь в магазине.
In Kazakh, there is also a systematic ambiguity between singular and plural in all 3rd person verb forms. For example, "барады" could mean "s/he/it goes" or "they go".
Syntactic ambiguity[edit]
Another form of ambiguity is syntactic ambiguity, that is where a sentence or phrase has more than one possible interpretation that cannot be resolved morphologically. An example might be prepositional phrase attachment, where it is not necessarily clear from the morphology to which constituent the prepositional phrase attaches. Consider the following example
- Сегодня я говорил с подругой Анны, с которой я познакомился вчера. (met Anna yesterday)
- Сегодня я говорил с подругой Анны, с которой я познакомился вчера. (met the friend of Anna yesterday)
A problem arises for machine translation when a sentence or phrase is ambiguous in the source language, but not in the target language. If the ambiguity is preserved there is less problem.
For example for the above sentence, the ambiguity is retained in the majority of western-European languages (e.g. Indo-European), but for XXX the two different readings require different translations:
- xxx
- yyy
Rule-based disambiguation[edit]
There are many ways of writing disambiguation rules, but the most important thing is to be able to express the rule in terms of the context that provides disambiguation. For example, while individual words can be very ambiguous, often they are disambiguated by context. Take the Russian phrase « нашим старым преподавателям », while both нашим and старым are quite ambiguous (three analyses for both), the head of the phrase преподавателям, which only has one analysis, disambiguates them.
- наш
<det>
<pos>
<m>
<sg>
<ins>
старый<adj>
<m>
<sg>
<ins>
преподаватель<n>
<m>
<aa>
<pl>
<dat>
- наш
<det>
<pos>
<nt>
<sg>
<ins>
старый<adj>
<m>
<sg>
<ins>
преподаватель<n>
<m>
<aa>
<pl>
<dat>
- наш
<det>
<pos>
<mfn>
<pl>
<dat>
старый<adj>
<m>
<sg>
<ins>
преподаватель<n>
<m>
<aa>
<pl>
<dat>
- наш
<det>
<pos>
<m>
<sg>
<ins>
старый<adj>
<mfn>
<pl>
<dat>
преподаватель<n>
<m>
<aa>
<pl>
<dat>
- наш
<det>
<pos>
<nt>
<sg>
<ins>
старый<adj>
<mfn>
<pl>
<dat>
преподаватель<n>
<m>
<aa>
<pl>
<dat>
- наш
<det>
<pos>
<mfn>
<pl>
<dat>
старый<adj>
<mfn>
<pl>
<dat>
преподаватель<n>
<m>
<aa>
<pl>
<dat>
- ...
We could thus conceive of writing a rule which removes the analyses that do not agree with the head of the phrase.
Constraint grammar[edit]
One way of writing rules is with a formalism called constraint grammar. Constraint grammar rules consist of two parts: an operation on a pattern, and a context. The following are examples of operations:
- select: Given a context, remove all the readings apart from the one(s) matched by the pattern.
- remove: Given a context, remove the reading(s) that match the pattern.
A context can be any combination of words or tags in a given sentence. To get an idea of the kind of contexts that can be used, let's look at some real disambiguation rules. We're going to use the Russian phrase:
- Ambiguity #1
The word детских is ambiguous between genitive, prepositional and accusative. We want to select the genitive reading.
SELECT Gen IF (0C A) (*1C GEN BARRIER NPNHA);
SELECT Gen
: Select genitive,IF
(0C A)
: The current word only has adjective readings. ( C means "careful" . 0 means the same position as the target word. A refers to a set, which should be defined earlier in the file. * )(*1C GEN BARRIER NPNHA)
: After the current word there is a word which is only in the genitive case. Keep searching the words after the current word until a word which is any word except a word which can modify a noun, or an adverb.
- Ambiguity #2
The word работающих is ambiguous between genitive plural, prepositional plural, accusative plural. We want to select the genitive plural reading.
SELECT Gen IF (0C ACC-GEN-PRP) (*-1C Num LINK 1C Gen BARRIER NOTGEN);
SELECT Gen
: Select genitive,IF
(0C ACC-GEN-PRP)
: The current word can only be accusative, genitive or locative(*-1C Num LINK 1C Gen BARRIER NOTGEN)
: Before the current word there is a word which can only be a numeral followed by a word which can only be genitive, keep searching towards the beginning of the sentence until there is a word that does not contain the genitive tag.
- Ambiguity #3
The word матери is ambiguous between genitive plural and accusative plural. We want to select the genitive plural reading (or better, remove the accusative).
REMOVE Acc IF (0C ACC-OR-GEN) (*-1C Num LINK 1C Gen BARRIER NOTGEN);
REMOVE Acc
: Remove accusative,IF
(0C ACC-OR-GEN)
: The current word can only be accusative or genitive(*-1C Num LINK 1C Gen BARRIER NOTGEN)
: There is a previous numeral which has a word which can only be genitive after it (e.g. тысяч) and we stop searching when we find a word that is not in genitive (BARRIER NOTGEN
)
Practice[edit]
In this practice session, we are going to run the tagger, discover some tagging errors and finally propose some disambiguation rules for solving these tagging errors.
Running the morphological analyser[edit]
Let's try running the morphological analyser on what seems like a fairly simple phrase:
echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin ^Уран/Уран<np><top><m><sg><acc>/Уран<np><top><m><sg><nom>$ ^—/—<guio>$ ^седьмая/седьмой<det><ord><f><sg><nom>$ ^по/по<pr>$ ^удалённости/удалённость<n><f><nn><sg><gen>/удалённость<n><f><nn><sg><dat>/удалённость<n><f><nn><sg><prp>/удалённость<n><f><nn><pl><acc>/удалённость<n><f><nn><pl><nom>$ ^от/от<pr>$ ^Солнца/Солнце<n><nt><nn><sg><gen>/Солнце<n><nt><nn><pl><acc>/Солнце<n><nt><nn><pl><nom>$ ^./.<sent>$
What seemed like a fairly simple phrase is not, there are multiple ambiguities.
Finding errors[edit]
Apertium comes with a statistical tagger, which can be run as follows:
$ echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin | apertium-tagger -p -g kv-ru.prob ^Уран/Уран<np><top><m><sg><acc>$ ^—/—<guio>$ ^седьмая/седьмой<det><ord><f><sg><nom>$ ^по/по<pr>$ ^удалённости/удалённость<n><f><nn><sg><gen>$ ^от/от<pr>$ ^Солнца/Солнце<n><nt><nn><sg><gen>$^./.<sent>$
It is fairly easy to find the errors in the tagging, there are two:
- "Уран" should be nominative, not accusative.
- After the preposition "по", the case of "удалённость" should be dative, not genitive.
The rest of the words have been disambiguated correctly.
Conceiving rules[edit]
When thinking about rules, it is important to take into account the following:
- The scope of application of the rule. Should it work just on the current word, the current word and one or two words context either side, at the level of the clause, or on the whole sentence ?
- Should the rule only apply to one lemma, or should it apply to any word that has the same part of speech ?
- The kind of ambiguity we want the rule to work on, for example, should it work on any ambiguity between nominative and accusative, or only ambiguities in case involving nouns ?
- There are often several rules which could equally well disambiguate a sentence, it is important not to get caught up in finding the "perfect" one.
In the above example, we could consider the following rules:
- If a word can only be nominative or accusative and the following word is the hyphen (—) with an adjective that can only be in nominative following, then select the nominative reading.
- After the preposition "по" remove any genitive readings.
For your chosen language pair, describe some rules that solve specific disambiguation problems you have found.
Constraint grammar[edit]
If you have finished with describing the rules, you can try coding them in constraint grammar, as in the Russian example above. Below is a skeleton constraint grammar file to encode the two rules from the previous example, and instructions on how to run it.
DELIMITERS = "<.>" "<!>" "<?>" ; SOFT-DELIMITERS = "<,>" ; LIST BOS = (>>>) ; # Beginning of sentence LIST EOS = (<<<) ; # End of sentence LIST Hyphen = guio ; LIST Nom = nom ; LIST Acc = acc ; LIST Gen = gen ; LIST Not-Gen-Prep = "по"; SET Acc-Or-Nom = Acc | Nom ; SECTION # Rule 1 REMOVE Acc IF # Remove accusative reading if, (0C Acc-Or-Nom) # the current word is only accusative or nominative (1C Hyphen LINK 1 Nom); # there is a hyphen directly to the right, with a nominative following # Rule 2 REMOVE Gen IF # Remove a genitive reading if, (-1C Not-Gen-Prep); # it is preceeded by a preposition which does not ever govern the genitive
Copy this file into a text editor, and save it as rules.rlx
. First we need to compile the rules:
$ cg-comp rules.rlx rules.bin Sections: 1, Rules: 2, Sets: 17, Tags: 18
And now run them:
$ echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin | cg-proc rules.bin ^Уран/Уран<np><top><m><sg><nom>$ ^—/—<guio>$ ^седьмая/седьмой<det><ord><f><sg><nom>$ ^по/по<pr>$ ^удалённости/удалённость<n><f><nn><pl><nom>/удалённость<n><f><nn><sg><dat>/удалённость<n><f><nn><sg><prp>/удалённость<n><f><nn><pl><acc>$ ^от/от<pr>$ ^Солнца/Солнце<n><nt><nn><sg><gen>/Солнце<n><nt><nn><pl><acc>/Солнце<n><nt><nn><pl><nom>$ ^./.<sent>$
And in conjunction with the apertium-tagger
:
$ echo "Уран — седьмая по удалённости от Солнца." | lt-proc ru-kv.automorf.bin | cg-proc rules.bin |\ apertium-tagger -g ru-kv.prob ^Уран<np><top><m><sg><nom>$ ^—<guio>$ ^седьмой<det><ord><f><sg><nom>$ ^по<pr>$ ^удалённость<n><f><nn><pl><nom>$ ^от<pr>$ ^Солнце<n><nt><nn><sg><gen>$^.<sent>$
We can see that although we've removed the incorrect genitive, now we get an incorrect plural nominative reading. As "по" never governs the nominative, we can remove that too, using a similar rule to the second one.
Further reading[edit]
- van Halteren, H. (1999) Syntactic wordclass tagging (Dordrecht: Kluwer)