Difference between revisions of "Bosnian-Croatian-Montenegrin-Serbian and Macedonian"

From Apertium
Jump to navigation Jump to search
 
(36 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}


==Source==
=Progress of the work in the bonding period=
Insofar, a new dictionary has been started from scratch, some paradigms added from the grammar of croatian, along with some closed word categories. For details see the Todo list.


<pre>
==Todo==
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk
</pre>


==Notes==
;Testing framework


;Bosnian-Croatian-Montenegrin-Serbian morphological lexicon
* <s>Set up pending/regression tests framework</s>
* Set testvoc
* Set up corpus/generation-test

;Serbo-Croatian dictionary
The reflex of yat:
The reflex of yat:
* Adding two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
* Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
* Updated the makefile and the xslt machinery so that this works
Modal verbs:

*<s>Verb to be (biti)</s>
Verbs(Marked for aspect, transitivity and reflexivity)
*<s>Clitic verb htjeti, to mark future,</s>
* Most of them are from a list extracted from the verbs in the mk monodix
Verbs(Marked for aspect and transitivity)
* Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient
*<s>suffixes for present, imperfect and aorist</s>

*the l-participle needs more detailed marking, behaves differently in respect to number
Adjectives:
Adjectives:
* marked for definiteness
* <s>One paradigm added, with quite extensive marking</s>
* entered as quadruples (positive, comparative, superlative, absolute superlative)


Other:
* Add the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix [in progress]
* Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix
** <s>prepositions (including the ones of type s/sa and k/ka, which need to be postprocessed in generation)</s>
** <s>conjunctions</s>
** <s>interjections</s>
** particles
** nouns (masculine, feminine, neuter)
** adjectives (the definite and indefinite form paradigms)
** verbs
* <s>Add the personal clitic and non-clitic pronouns</s>, <s>add the reflexive clitic and non-clitic pronouns</s>, possesive, interrogative, relational, demonstrative (pronoun, and demonstrative adjective), indefinite, negative ...
* <s>Add the clitic form of the verb to be</s>, <s>the long present form</s>, other tenses auxilliary verbs
* Obtain a grammar of Serbian, for reference on differences


;Macedonian dictionary
;Macedonian dictionary


* Add determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
* Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
* Added some words from closed categories (adverbs, ...)


;Bilingual dictionary
;Bilingual dictionary
* Update the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
* Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix


;Transfer rules
;Transfer rules

==Transfer rules==
Three stage transfer is used.

Some problems:

Genitive constructions, problem in distinguishing partitive from possesive:
* Čaša vode = Чаша вода [partitive]
* Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive]
* TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which:
** '''More frequently appear as partitive'''
** '''More frequently appear as possesive'''
** Can appear either way
and afterwards ignore the latter case, and make two categories of noun pairs.

Tricky sequences (instrumental adjectives):
* "...upravljanje '''velikom''', '''jakom''' ''''pticom'''' " == "...управување '''со''' '''голема''', '''силна''' ''''птица'''' ".
Only one preposition needed for the whole chain, which can be of any length.
*:What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM
*:These sequences will probably be fairly infrequent though... - [[User:Francis Tyers|Francis Tyers]] 09:05, 20 July 2011 (UTC)

Change of gender in translation:
...vožnja'''<f>''' zrakoplovom bila je odlučujuća'''<f>'''... == ...возење'''<nt>''' со авион беше решавачко'''<f>'''.
Potentialy too far to be matched.

Obligatory clitic with definite object
* I saw the man == Го видов човекот
The clitics must preceed the finite verb, in this order:
* subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb
* ...да не ќе сум си му го дал... == that I won't have given it to him...
Possesive genitive to на construcions:
* Anina čaša == чашата на Ана (Ana's glass)
* Anina ruka == раката на Ана (Ana's hand)
* (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist)
Preposition 's' when standing with genitive does not translate as 'со':
* ...sam poletjela s avionske piste...==...полетав '''од''' авионската писта... and not ...'''со''' авионската писта..., which would have an instrumental meaning

==Disambiguation rules==
For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate
* ...sa njenim toplim nježnim rukama - A sequence in instrumental
* ...sa svojih toplih nježnih ruku - A sequence in genitive

Nominative or accusative:
* Select Accusative if there's a preceeding Accusative preposition
* Select Accusative if there's a preceeding transitive verb (direct object rule)

Genitive or accusative:
* Disambiguate based on prepositions (the intersection is only "u")
* Select accusative if there's a preceeding transitive verb (direct object rule)

Dative or locative:
* Select dative if there's no preceeding preposition or modifier in dative
* Select dative if there's a preceeding dative preposition
* Select locative if there's a preceeding locative preposition

Instrumental (unambiguous in singular):
* Select instrumental if there's a preceeding instrumental preposition
* In plural identical to Dative/Locative
** (easily disambiguated from Locative, since the latter is entirely prepositional)
** Possible problems in the plural with instrumental/non-instrumental, though not yet encountered
*** "Ljudima sam pomeo pod" - 'To the people' or 'using the people'
*** "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas')

Numbers:
* Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:
** "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary.
** "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech.


==See also==
==See also==
Line 49: Line 106:
* [[/Pending tests|Pending tests]]
* [[/Pending tests|Pending tests]]
* [[/Regression tests|Regression tests]]
* [[/Regression tests|Regression tests]]
* [[/Final_report|Final report]]

;TODOs
* Set up corpus/generation-test


==External links==
==External links==
Line 54: Line 115:
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian]
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian]
* [http://hjp.srce.hr/ Hrvatski jezični portal] &mdash; Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} )
* [http://hjp.srce.hr/ Hrvatski jezični portal] &mdash; Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} )
* [http://rechnik.on.net.mk/ Macedonian<->Serbian online dictionary]
* [http://www.makedonski.info/ Word definitions for Macedonian]
* [http://www.mling.ru/iazik/makedonski/history_makedonski.pdf Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)]

==Further reading==

* Ivan Todorović "Disambiguation of Serbian sentences with Unitex".


==References==
==References==
Line 59: Line 127:
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar]
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar]
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar]
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar]
[[Category:Serbo-Croatian and Macedonian|*]]
[[Category:Bosnian-Croatian-Montenegrin-Serbian and Macedonian|*]]
[[Category:Bosnian-Croatian-Montenegrin-Serbian]]
[[Category:Macedonian]]

Latest revision as of 05:34, 22 August 2017

Source[edit]

https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk

Notes[edit]

Bosnian-Croatian-Montenegrin-Serbian morphological lexicon

The reflex of yat:

  • Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
  • Updated the makefile and the xslt machinery so that this works

Verbs(Marked for aspect, transitivity and reflexivity)

  • Most of them are from a list extracted from the verbs in the mk monodix
  • Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient

Adjectives:

  • marked for definiteness
  • entered as quadruples (positive, comparative, superlative, absolute superlative)

Other:

  • Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix
Macedonian dictionary
  • Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
  • Added some words from closed categories (adverbs, ...)
Bilingual dictionary
  • Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
Transfer rules

Transfer rules[edit]

Three stage transfer is used.

Some problems:

Genitive constructions, problem in distinguishing partitive from possesive:

  • Čaša vode = Чаша вода [partitive]
  • Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive]
  • TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which:
    • More frequently appear as partitive
    • More frequently appear as possesive
    • Can appear either way

and afterwards ignore the latter case, and make two categories of noun pairs.

Tricky sequences (instrumental adjectives):

  • "...upravljanje velikom, jakom 'pticom' " == "...управување со голема, силна 'птица' ".

Only one preposition needed for the whole chain, which can be of any length.

  • What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM
    These sequences will probably be fairly infrequent though... - Francis Tyers 09:05, 20 July 2011 (UTC)

Change of gender in translation: ...vožnja<f> zrakoplovom bila je odlučujuća<f>... == ...возење<nt> со авион беше решавачко<f>. Potentialy too far to be matched.

Obligatory clitic with definite object

  • I saw the man == Го видов човекот

The clitics must preceed the finite verb, in this order:

  • subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb
  • ...да не ќе сум си му го дал... == that I won't have given it to him...

Possesive genitive to на construcions:

  • Anina čaša == чашата на Ана (Ana's glass)
  • Anina ruka == раката на Ана (Ana's hand)
  • (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist)

Preposition 's' when standing with genitive does not translate as 'со':

  • ...sam poletjela s avionske piste...==...полетав од авионската писта... and not ...со авионската писта..., which would have an instrumental meaning

Disambiguation rules[edit]

For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate

  • ...sa njenim toplim nježnim rukama - A sequence in instrumental
  • ...sa svojih toplih nježnih ruku - A sequence in genitive

Nominative or accusative:

  • Select Accusative if there's a preceeding Accusative preposition
  • Select Accusative if there's a preceeding transitive verb (direct object rule)

Genitive or accusative:

  • Disambiguate based on prepositions (the intersection is only "u")
  • Select accusative if there's a preceeding transitive verb (direct object rule)

Dative or locative:

  • Select dative if there's no preceeding preposition or modifier in dative
  • Select dative if there's a preceeding dative preposition
  • Select locative if there's a preceeding locative preposition

Instrumental (unambiguous in singular):

  • Select instrumental if there's a preceeding instrumental preposition
  • In plural identical to Dative/Locative
    • (easily disambiguated from Locative, since the latter is entirely prepositional)
    • Possible problems in the plural with instrumental/non-instrumental, though not yet encountered
      • "Ljudima sam pomeo pod" - 'To the people' or 'using the people'
      • "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas')

Numbers:

  • Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:
    • "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary.
    • "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech.

See also[edit]

TODOs
  • Set up corpus/generation-test

External links[edit]

Further reading[edit]

  • Ivan Todorović "Disambiguation of Serbian sentences with Unitex".

References[edit]