Difference between revisions of "Marathi-Hindi Developer Documentation"

From Apertium
Jump to navigation Jump to search
(Add clitics explanation)
(Add disambiguation explanation)
Line 41: Line 41:
 
=== Disambiguation ===
 
=== Disambiguation ===
   
  +
[[Constraint Grammar]] (CG3) disambiguation rules currently exist for differentiating between:
  +
* pronouns and determiners: e.g. तो आला (''to aala'' = he came) where तो (''to'' = he) is a pronoun, versus तो मुलगा आला (''to mulga aala'' = that boy came) where तो (''to'' = that (masc. sg.)) is a determiner
  +
* feminine singular and neuter plural: e.g. ती मुलगी (''ti mulgi'' = that girl (fem. sg.) versus ती झाडे (''ti zhade'' = those trees (nt. pl.))
  +
* neuter singular and masculine plural: e.g. ते झाड (''te zhad'' = that tree (nt. sg.)) versus ते घोडे (''te ghode'' = those horses (masc. pl.))
  +
* feminine plural and masc/fem/nt oblique: e.g. त्या मुली (''tya muli'' = those girls (fem. pl.)) versus त्या घोड्याला (''tya ghodyala'' = to that horse (masc. sg. obl.))
  +
  +
The examples above are all determiners but currently, the same rules work for adjectives and genitives too.
   
 
= apertium-mar-hin =
 
= apertium-mar-hin =

Revision as of 05:06, 28 January 2018

This is documentation of significant changes made to apertium-mar-hin (and the individual language modules apertium-mar and apertium-hin) since 20171128.

apertium-mar

Genitives

Consider the Marathi phrases:

  • त्याचा घोडा (tyacha ghoda) = his horse
  • त्याची गाय (tyachi gaay) = his cow
  • तिचा घोडा (ticha ghoda) = her horse
  • तिची गाय (tichi gaay) = her cow

The possessive determiners are affected by the gender of the possessor—'his' versus 'her'—and also the gender of the possessed—घोडा (ghoda) is grammatically masculine and गाय (gaay) is feminine. So the analysis of the determiners must have separate lemmas for the possessor part and the possessed part, so that both genders can be specified. Thus these are analyzed as

  • ^त्याचा/तो<det><p3><dist><m><sg><obl>+च<gen><m><sg><nom>$ ^घोडा/घोडा<n><m><sg><nom>$
  • ^त्याची/तो<det><p3><dist><m><sg><obl>+च<gen><f><sg><nom>$ ^गाय/गाय<n><f><sg><nom>$
  • ^तिचा/तो<det><p3><dist><f><sg><obl>+च<gen><m><sg><nom>$ ^घोडा/घोडा<n><m><sg><nom>$
  • ^तिची/तो<det><p3><dist><f><sg><obl>+च<gen><f><sg><nom>$ ^गाय/गाय<n><f><sg><nom>$

The reason for the <obl> is explained in a section below.

Note: The च<gen> lemma is the genitive marker for almost all words which can take a genitive: determiners like those above, nouns, postpositions, etc. The only exceptions are some determiners such as माझा (mazha = my (masc. sg.)), आपला (aapla = our (incl., masc. sg.)), etc. These are nevertheless analyzed as ^माझा/मी<det><p1><mf><sg><obl>+च<gen><m><sg><nom>$ etc. using the same lemma.

Another note: Marathi has three grammatical genders and two grammatical numbers, all of which have the above phenomenon.

Analysis of pronouns and determiners

The distinction between pronouns and determiners is made according to the Universal Dependencies guidelines for pronouns and determiners. Roughly, words that could be replaced by nouns are considered pronouns, and words that could be replaced by adjectives are considered determiners. For example, this means that त्याचा (tyacha = his (masc. sg.)) is a determiner but त्याला (tyala = to him) is a pronoun.

Since the genitive stem च is analyzed as a separate lemma but joined to the previous word, for consistency other stems like ला above are also analyzed the same way. E.g. ^त्याला/तो<prn><p3><dist><m><sg><obl>+ला<dat>$. (Otherwise they would be analyzed as just regular inflections.) Another justification for this is that these case markers are occasionally present on their own, not attached to the previous word.

When there is a joined lemma, the pronoun/determiner itself is marked in the oblique (<obl>) case, because the oblique form can be followed by case markers, postpositions, nouns in a non-nominative case, etc.

Clitics

There is primarily only one clitic: च्या (chya). Most of the complexity in the paradigms in the monodix come from breaking them up to cover all the use cases of the clitic.

Sometimes the clitic is compulsory. For example, the pronoun ती (ti = she) requires the clitic to make तिच्यात (tichyat = in her); तित is ungrammatical. The analyses are in the form ^तिच्यात/तो<prn><p3><dist><f><sg><obl>+च्या<clit>+त<loc>$. The clitic becomes झ्या (jhya) for e.g. the pronoun मी (mi = me) but it is still analyzed as च्या (chya) for simplicity. The compulsory clitics need to be explicitly specified when translating to Marathi.

Sometimes the clitic is optional. For example, त्यात (tyat = in him) is equally valid as त्याच्यात (tyachyat = in him). The analyses are ^त्यात/तो<prn><p3><dist><m><sg><obl>+त<loc>$ and ^त्यात/तो<prn><p3><dist><m><sg><obl>+च्या<clit>+त<loc>$.

Disambiguation

Constraint Grammar (CG3) disambiguation rules currently exist for differentiating between:

  • pronouns and determiners: e.g. तो आला (to aala = he came) where तो (to = he) is a pronoun, versus तो मुलगा आला (to mulga aala = that boy came) where तो (to = that (masc. sg.)) is a determiner
  • feminine singular and neuter plural: e.g. ती मुलगी (ti mulgi = that girl (fem. sg.) versus ती झाडे (ti zhade = those trees (nt. pl.))
  • neuter singular and masculine plural: e.g. ते झाड (te zhad = that tree (nt. sg.)) versus ते घोडे (te ghode = those horses (masc. pl.))
  • feminine plural and masc/fem/nt oblique: e.g. त्या मुली (tya muli = those girls (fem. pl.)) versus त्या घोड्याला (tya ghodyala = to that horse (masc. sg. obl.))

The examples above are all determiners but currently, the same rules work for adjectives and genitives too.

apertium-mar-hin

Gender and number agreement