Difference between revisions of "Bosnian-Croatian-Montenegrin-Serbian and Macedonian"
(→Todo) |
Firespeaker (talk | contribs) |
||
(39 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
==Source== |
|||
=Progress of the work in the bonding period= |
|||
Insofar, a new dictionary has been started from scratch, some paradigms added from the grammar of croatian, along with some closed word categories. For details see the Todo list. |
|||
<pre> |
|||
==Todo== |
|||
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk |
|||
</pre> |
|||
==Notes== |
|||
;Testing framework |
|||
;Bosnian-Croatian-Montenegrin-Serbian morphological lexicon |
|||
* <s>Set up pending/regression tests framework</s> |
|||
The reflex of yat: |
|||
* Set testvoc |
|||
* Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian |
|||
* Set up corpus/generation-test |
|||
* Updated the makefile and the xslt machinery so that this works |
|||
Verbs(Marked for aspect, transitivity and reflexivity) |
|||
* Most of them are from a list extracted from the verbs in the mk monodix |
|||
* Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient |
|||
;Serbo-Croatian dictionary |
|||
Adjectives: |
Adjectives: |
||
* marked for definiteness |
|||
The animacy when crossed with definiteness gives a lot of double entries. Since some of the cases (G, D/L,I singular, and D/L/I plural for instance) do not specifically mark a gender, I have removed the animacy in those cases and in accord marked them "mn", or "mfn". |
|||
* entered as quadruples (positive, comparative, superlative, absolute superlative) |
|||
<s>*Idea: unify the D/L/I plural into one case, and D/L singular into one case, since they are always morphologically identical. |
|||
** The paradigms entered would be more concise |
|||
Other: |
|||
** Would complicate matters with future translation pairs with an other slavic languages, i.e. Slovene |
|||
* Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix |
|||
** If incorporating dialect words into the dictionary (i.e. kajkavian or čakavian), the separate markers for cases would have to be used</s> |
|||
*In the macedonian monodix no adjective is marked positive, only comparative and superlative, therefore I'm taking the same approach. |
|||
* Add the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix [in progress] |
|||
** <s>prepositions (including the ones of type s/sa and k/ka, which need to be postprocessed in generation)</s> |
|||
** <s>conjunctions</s> |
|||
** <s>interjections</s> |
|||
** particles |
|||
** nouns (masculine, feminine, neuter) |
|||
** adjectives (the definite and indefinite form paradigms) |
|||
** verbs |
|||
* <s>Add the personal clitic and non-clitic pronouns</s>, <s>add the reflexive clitic and non-clitic pronouns</s>, possesive, interrogative, relational, demonstrative (pronoun, and demonstrative adjective), indefinite, negative ... |
|||
* <s>Add the clitic form of the verb to be</s>, <s>the long present form</s>, other tenses auxilliary verbs |
|||
* Obtain a grammar of Serbian, for reference on differences |
|||
;Macedonian dictionary |
;Macedonian dictionary |
||
* |
* Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns |
||
* Added some words from closed categories (adverbs, ...) |
|||
;Bilingual dictionary |
;Bilingual dictionary |
||
* |
* Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix |
||
;Transfer rules |
;Transfer rules |
||
==Transfer rules== |
|||
Three stage transfer is used. |
|||
Some problems: |
|||
Genitive constructions, problem in distinguishing partitive from possesive: |
|||
* Čaša vode = Чаша вода [partitive] |
|||
* Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive] |
|||
* TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which: |
|||
** '''More frequently appear as partitive''' |
|||
** '''More frequently appear as possesive''' |
|||
** Can appear either way |
|||
and afterwards ignore the latter case, and make two categories of noun pairs. |
|||
Tricky sequences (instrumental adjectives): |
|||
* "...upravljanje '''velikom''', '''jakom''' ''''pticom'''' " == "...управување '''со''' '''голема''', '''силна''' ''''птица'''' ". |
|||
Only one preposition needed for the whole chain, which can be of any length. |
|||
*:What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM |
|||
*:These sequences will probably be fairly infrequent though... - [[User:Francis Tyers|Francis Tyers]] 09:05, 20 July 2011 (UTC) |
|||
Change of gender in translation: |
|||
...vožnja'''<f>''' zrakoplovom bila je odlučujuća'''<f>'''... == ...возење'''<nt>''' со авион беше решавачко'''<f>'''. |
|||
Potentialy too far to be matched. |
|||
Obligatory clitic with definite object |
|||
* I saw the man == Го видов човекот |
|||
The clitics must preceed the finite verb, in this order: |
|||
* subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb |
|||
* ...да не ќе сум си му го дал... == that I won't have given it to him... |
|||
Possesive genitive to на construcions: |
|||
* Anina čaša == чашата на Ана (Ana's glass) |
|||
* Anina ruka == раката на Ана (Ana's hand) |
|||
* (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist) |
|||
Preposition 's' when standing with genitive does not translate as 'со': |
|||
* ...sam poletjela s avionske piste...==...полетав '''од''' авионската писта... and not ...'''со''' авионската писта..., which would have an instrumental meaning |
|||
==Disambiguation rules== |
|||
For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate |
|||
* ...sa njenim toplim nježnim rukama - A sequence in instrumental |
|||
* ...sa svojih toplih nježnih ruku - A sequence in genitive |
|||
Nominative or accusative: |
|||
* Select Accusative if there's a preceeding Accusative preposition |
|||
* Select Accusative if there's a preceeding transitive verb (direct object rule) |
|||
Genitive or accusative: |
|||
* Disambiguate based on prepositions (the intersection is only "u") |
|||
* Select accusative if there's a preceeding transitive verb (direct object rule) |
|||
Dative or locative: |
|||
* Select dative if there's no preceeding preposition or modifier in dative |
|||
* Select dative if there's a preceeding dative preposition |
|||
* Select locative if there's a preceeding locative preposition |
|||
Instrumental (unambiguous in singular): |
|||
* Select instrumental if there's a preceeding instrumental preposition |
|||
* In plural identical to Dative/Locative |
|||
** (easily disambiguated from Locative, since the latter is entirely prepositional) |
|||
** Possible problems in the plural with instrumental/non-instrumental, though not yet encountered |
|||
*** "Ljudima sam pomeo pod" - 'To the people' or 'using the people' |
|||
*** "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas') |
|||
Numbers: |
|||
* Numbers 2-4 govern noun phrases differently (remnants of dual), two variants: |
|||
** "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary. |
|||
** "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech. |
|||
==See also== |
==See also== |
||
Line 45: | Line 106: | ||
* [[/Pending tests|Pending tests]] |
* [[/Pending tests|Pending tests]] |
||
* [[/Regression tests|Regression tests]] |
* [[/Regression tests|Regression tests]] |
||
* [[/Final_report|Final report]] |
|||
;TODOs |
|||
* Set up corpus/generation-test |
|||
==External links== |
==External links== |
||
Line 50: | Line 115: | ||
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian] |
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian] |
||
* [http://hjp.srce.hr/ Hrvatski jezični portal] — Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} ) |
* [http://hjp.srce.hr/ Hrvatski jezični portal] — Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} ) |
||
* [http://rechnik.on.net.mk/ Macedonian<->Serbian online dictionary] |
|||
* [http://www.makedonski.info/ Word definitions for Macedonian] |
|||
* [http://www.mling.ru/iazik/makedonski/history_makedonski.pdf Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)] |
|||
==Further reading== |
|||
* Ivan Todorović "Disambiguation of Serbian sentences with Unitex". |
|||
==References== |
==References== |
||
Line 55: | Line 127: | ||
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar] |
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar] |
||
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar] |
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar] |
||
[[Category: |
[[Category:Bosnian-Croatian-Montenegrin-Serbian and Macedonian|*]] |
||
[[Category:Bosnian-Croatian-Montenegrin-Serbian]] |
|||
[[Category:Macedonian]] |
Latest revision as of 05:34, 22 August 2017
Source[edit]
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk
Notes[edit]
- Bosnian-Croatian-Montenegrin-Serbian morphological lexicon
The reflex of yat:
- Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
- Updated the makefile and the xslt machinery so that this works
Verbs(Marked for aspect, transitivity and reflexivity)
- Most of them are from a list extracted from the verbs in the mk monodix
- Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient
Adjectives:
- marked for definiteness
- entered as quadruples (positive, comparative, superlative, absolute superlative)
Other:
- Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix
- Macedonian dictionary
- Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
- Added some words from closed categories (adverbs, ...)
- Bilingual dictionary
- Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
- Transfer rules
Transfer rules[edit]
Three stage transfer is used.
Some problems:
Genitive constructions, problem in distinguishing partitive from possesive:
- Čaša vode = Чаша вода [partitive]
- Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive]
- TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which:
- More frequently appear as partitive
- More frequently appear as possesive
- Can appear either way
and afterwards ignore the latter case, and make two categories of noun pairs.
Tricky sequences (instrumental adjectives):
- "...upravljanje velikom, jakom 'pticom' " == "...управување со голема, силна 'птица' ".
Only one preposition needed for the whole chain, which can be of any length.
- What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM
- These sequences will probably be fairly infrequent though... - Francis Tyers 09:05, 20 July 2011 (UTC)
Change of gender in translation: ...vožnja<f> zrakoplovom bila je odlučujuća<f>... == ...возење<nt> со авион беше решавачко<f>. Potentialy too far to be matched.
Obligatory clitic with definite object
- I saw the man == Го видов човекот
The clitics must preceed the finite verb, in this order:
- subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb
- ...да не ќе сум си му го дал... == that I won't have given it to him...
Possesive genitive to на construcions:
- Anina čaša == чашата на Ана (Ana's glass)
- Anina ruka == раката на Ана (Ana's hand)
- (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist)
Preposition 's' when standing with genitive does not translate as 'со':
- ...sam poletjela s avionske piste...==...полетав од авионската писта... and not ...со авионската писта..., which would have an instrumental meaning
Disambiguation rules[edit]
For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate
- ...sa njenim toplim nježnim rukama - A sequence in instrumental
- ...sa svojih toplih nježnih ruku - A sequence in genitive
Nominative or accusative:
- Select Accusative if there's a preceeding Accusative preposition
- Select Accusative if there's a preceeding transitive verb (direct object rule)
Genitive or accusative:
- Disambiguate based on prepositions (the intersection is only "u")
- Select accusative if there's a preceeding transitive verb (direct object rule)
Dative or locative:
- Select dative if there's no preceeding preposition or modifier in dative
- Select dative if there's a preceeding dative preposition
- Select locative if there's a preceeding locative preposition
Instrumental (unambiguous in singular):
- Select instrumental if there's a preceeding instrumental preposition
- In plural identical to Dative/Locative
- (easily disambiguated from Locative, since the latter is entirely prepositional)
- Possible problems in the plural with instrumental/non-instrumental, though not yet encountered
- "Ljudima sam pomeo pod" - 'To the people' or 'using the people'
- "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas')
Numbers:
- Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:
- "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary.
- "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech.
See also[edit]
- TODOs
- Set up corpus/generation-test
External links[edit]
- Wikipedia: Differences in standard Bosnian, Croatian and Serbian
- Hrvatski jezični portal — Croatian language portal, word definitions with inflection (find the definition and click on izvedeni oblici )
- Macedonian<->Serbian online dictionary
- Word definitions for Macedonian
- Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)
Further reading[edit]
- Ivan Todorović "Disambiguation of Serbian sentences with Unitex".