Difference between revisions of "Bosnian-Croatian-Montenegrin-Serbian and Macedonian"

From Apertium
Jump to navigation Jump to search
 
(37 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
   
  +
==Source==
=Progress of the work in the bonding period=
 
Insofar, a new dictionary has been started from scratch, some paradigms added from the grammar of croatian, along with some closed word categories. For details see the Todo list.
 
   
  +
<pre>
==Todo==
 
  +
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk
  +
</pre>
   
  +
==Notes==
;Testing framework
 
   
  +
;Bosnian-Croatian-Montenegrin-Serbian morphological lexicon
* <s>Set up pending/regression tests framework</s>
 
  +
The reflex of yat:
* Set testvoc
 
  +
* Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
* Set up corpus/generation-test
 
  +
* Updated the makefile and the xslt machinery so that this works
  +
  +
Verbs(Marked for aspect, transitivity and reflexivity)
  +
* Most of them are from a list extracted from the verbs in the mk monodix
  +
* Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient
   
;Serbo-Croatian dictionary
 
Modal verbs:
 
*Verb biti
 
*Clitic verb htjeti, to mark future, needs some fine tuning
 
Verbs(Marked for aspect and transitivity)
 
*<s>suffixes for present, imperfect and aorist</s>
 
*the l-participle needs more detailed marking, behaves differently in respect to number
 
 
Adjectives:
 
Adjectives:
  +
* marked for definiteness
* <s>One paradigm added, with quite extensive marking</s>
 
  +
* entered as quadruples (positive, comparative, superlative, absolute superlative)
   
  +
Other:
* Add the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix [in progress]
 
  +
* Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix
** <s>prepositions (including the ones of type s/sa and k/ka, which need to be postprocessed in generation)</s>
 
** <s>conjunctions</s>
 
** <s>interjections</s>
 
** particles
 
** nouns (masculine, feminine, neuter)
 
** adjectives (the definite and indefinite form paradigms)
 
** verbs
 
* <s>Add the personal clitic and non-clitic pronouns</s>, <s>add the reflexive clitic and non-clitic pronouns</s>, possesive, interrogative, relational, demonstrative (pronoun, and demonstrative adjective), indefinite, negative ...
 
* <s>Add the clitic form of the verb to be</s>, <s>the long present form</s>, other tenses auxilliary verbs
 
* Obtain a grammar of Serbian, for reference on differences
 
   
 
;Macedonian dictionary
 
;Macedonian dictionary
   
* Add determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
+
* Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
  +
* Added some words from closed categories (adverbs, ...)
   
 
;Bilingual dictionary
 
;Bilingual dictionary
* Update the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
+
* Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
   
 
;Transfer rules
 
;Transfer rules
  +
  +
==Transfer rules==
  +
Three stage transfer is used.
  +
  +
Some problems:
  +
  +
Genitive constructions, problem in distinguishing partitive from possesive:
  +
* Čaša vode = Чаша вода [partitive]
  +
* Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive]
  +
* TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which:
  +
** '''More frequently appear as partitive'''
  +
** '''More frequently appear as possesive'''
  +
** Can appear either way
  +
and afterwards ignore the latter case, and make two categories of noun pairs.
  +
  +
Tricky sequences (instrumental adjectives):
  +
* "...upravljanje '''velikom''', '''jakom''' ''''pticom'''' " == "...управување '''со''' '''голема''', '''силна''' ''''птица'''' ".
  +
Only one preposition needed for the whole chain, which can be of any length.
  +
*:What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM
  +
*:These sequences will probably be fairly infrequent though... - [[User:Francis Tyers|Francis Tyers]] 09:05, 20 July 2011 (UTC)
  +
  +
Change of gender in translation:
  +
...vožnja'''<f>''' zrakoplovom bila je odlučujuća'''<f>'''... == ...возење'''<nt>''' со авион беше решавачко'''<f>'''.
  +
Potentialy too far to be matched.
  +
  +
Obligatory clitic with definite object
  +
* I saw the man == Го видов човекот
  +
The clitics must preceed the finite verb, in this order:
  +
* subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb
  +
* ...да не ќе сум си му го дал... == that I won't have given it to him...
  +
Possesive genitive to на construcions:
  +
* Anina čaša == чашата на Ана (Ana's glass)
  +
* Anina ruka == раката на Ана (Ana's hand)
  +
* (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist)
  +
Preposition 's' when standing with genitive does not translate as 'со':
  +
* ...sam poletjela s avionske piste...==...полетав '''од''' авионската писта... and not ...'''со''' авионската писта..., which would have an instrumental meaning
  +
  +
==Disambiguation rules==
  +
For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate
  +
* ...sa njenim toplim nježnim rukama - A sequence in instrumental
  +
* ...sa svojih toplih nježnih ruku - A sequence in genitive
  +
  +
Nominative or accusative:
  +
* Select Accusative if there's a preceeding Accusative preposition
  +
* Select Accusative if there's a preceeding transitive verb (direct object rule)
  +
  +
Genitive or accusative:
  +
* Disambiguate based on prepositions (the intersection is only "u")
  +
* Select accusative if there's a preceeding transitive verb (direct object rule)
  +
  +
Dative or locative:
  +
* Select dative if there's no preceeding preposition or modifier in dative
  +
* Select dative if there's a preceeding dative preposition
  +
* Select locative if there's a preceeding locative preposition
  +
  +
Instrumental (unambiguous in singular):
  +
* Select instrumental if there's a preceeding instrumental preposition
  +
* In plural identical to Dative/Locative
  +
** (easily disambiguated from Locative, since the latter is entirely prepositional)
  +
** Possible problems in the plural with instrumental/non-instrumental, though not yet encountered
  +
*** "Ljudima sam pomeo pod" - 'To the people' or 'using the people'
  +
*** "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas')
  +
  +
Numbers:
  +
* Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:
  +
** "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary.
  +
** "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech.
   
 
==See also==
 
==See also==
Line 47: Line 106:
 
* [[/Pending tests|Pending tests]]
 
* [[/Pending tests|Pending tests]]
 
* [[/Regression tests|Regression tests]]
 
* [[/Regression tests|Regression tests]]
  +
* [[/Final_report|Final report]]
  +
  +
;TODOs
  +
* Set up corpus/generation-test
   
 
==External links==
 
==External links==
Line 52: Line 115:
 
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian]
 
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian]
 
* [http://hjp.srce.hr/ Hrvatski jezični portal] &mdash; Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} )
 
* [http://hjp.srce.hr/ Hrvatski jezični portal] &mdash; Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} )
  +
* [http://rechnik.on.net.mk/ Macedonian<->Serbian online dictionary]
  +
* [http://www.makedonski.info/ Word definitions for Macedonian]
  +
* [http://www.mling.ru/iazik/makedonski/history_makedonski.pdf Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)]
  +
  +
==Further reading==
  +
  +
* Ivan Todorović "Disambiguation of Serbian sentences with Unitex".
   
 
==References==
 
==References==
Line 57: Line 127:
 
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar]
 
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar]
 
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar]
 
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar]
[[Category:Serbo-Croatian and Macedonian|*]]
+
[[Category:Bosnian-Croatian-Montenegrin-Serbian and Macedonian|*]]
  +
[[Category:Bosnian-Croatian-Montenegrin-Serbian]]
  +
[[Category:Macedonian]]

Latest revision as of 05:34, 22 August 2017

Source[edit]

https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk

Notes[edit]

Bosnian-Croatian-Montenegrin-Serbian morphological lexicon

The reflex of yat:

  • Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
  • Updated the makefile and the xslt machinery so that this works

Verbs(Marked for aspect, transitivity and reflexivity)

  • Most of them are from a list extracted from the verbs in the mk monodix
  • Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient

Adjectives:

  • marked for definiteness
  • entered as quadruples (positive, comparative, superlative, absolute superlative)

Other:

  • Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix
Macedonian dictionary
  • Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
  • Added some words from closed categories (adverbs, ...)
Bilingual dictionary
  • Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
Transfer rules

Transfer rules[edit]

Three stage transfer is used.

Some problems:

Genitive constructions, problem in distinguishing partitive from possesive:

  • Čaša vode = Чаша вода [partitive]
  • Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive]
  • TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which:
    • More frequently appear as partitive
    • More frequently appear as possesive
    • Can appear either way

and afterwards ignore the latter case, and make two categories of noun pairs.

Tricky sequences (instrumental adjectives):

  • "...upravljanje velikom, jakom 'pticom' " == "...управување со голема, силна 'птица' ".

Only one preposition needed for the whole chain, which can be of any length.

  • What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM
    These sequences will probably be fairly infrequent though... - Francis Tyers 09:05, 20 July 2011 (UTC)

Change of gender in translation: ...vožnja<f> zrakoplovom bila je odlučujuća<f>... == ...возење<nt> со авион беше решавачко<f>. Potentialy too far to be matched.

Obligatory clitic with definite object

  • I saw the man == Го видов човекот

The clitics must preceed the finite verb, in this order:

  • subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb
  • ...да не ќе сум си му го дал... == that I won't have given it to him...

Possesive genitive to на construcions:

  • Anina čaša == чашата на Ана (Ana's glass)
  • Anina ruka == раката на Ана (Ana's hand)
  • (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist)

Preposition 's' when standing with genitive does not translate as 'со':

  • ...sam poletjela s avionske piste...==...полетав од авионската писта... and not ...со авионската писта..., which would have an instrumental meaning

Disambiguation rules[edit]

For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate

  • ...sa njenim toplim nježnim rukama - A sequence in instrumental
  • ...sa svojih toplih nježnih ruku - A sequence in genitive

Nominative or accusative:

  • Select Accusative if there's a preceeding Accusative preposition
  • Select Accusative if there's a preceeding transitive verb (direct object rule)

Genitive or accusative:

  • Disambiguate based on prepositions (the intersection is only "u")
  • Select accusative if there's a preceeding transitive verb (direct object rule)

Dative or locative:

  • Select dative if there's no preceeding preposition or modifier in dative
  • Select dative if there's a preceeding dative preposition
  • Select locative if there's a preceeding locative preposition

Instrumental (unambiguous in singular):

  • Select instrumental if there's a preceeding instrumental preposition
  • In plural identical to Dative/Locative
    • (easily disambiguated from Locative, since the latter is entirely prepositional)
    • Possible problems in the plural with instrumental/non-instrumental, though not yet encountered
      • "Ljudima sam pomeo pod" - 'To the people' or 'using the people'
      • "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas')

Numbers:

  • Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:
    • "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary.
    • "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech.

See also[edit]

TODOs
  • Set up corpus/generation-test

External links[edit]

Further reading[edit]

  • Ivan Todorović "Disambiguation of Serbian sentences with Unitex".

References[edit]