Difference between revisions of "Bosnian-Croatian-Montenegrin-Serbian and Macedonian"
Firespeaker (talk | contribs) |
|||
(56 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
==Source== |
|||
<pre> |
|||
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk |
|||
</pre> |
|||
==Notes== |
|||
==Serbo-Croatian to Macedonian== |
|||
;Bosnian-Croatian-Montenegrin-Serbian morphological lexicon |
|||
===Prepositions=== |
|||
The reflex of yat: |
|||
* Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian |
|||
* Updated the makefile and the xslt machinery so that this works |
|||
Verbs(Marked for aspect, transitivity and reflexivity) |
|||
;'na' and 'u' |
|||
* Most of them are from a list extracted from the verbs in the mk monodix |
|||
* Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient |
|||
Adjectives: |
|||
Deciding between if something should be in the accusative or locative after 'na' is a problem. |
|||
* marked for definiteness |
|||
* entered as quadruples (positive, comparative, superlative, absolute superlative) |
|||
Other: |
|||
:"The prepositions na, u have locative case for position, accusative case for motion" |
|||
* Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix |
|||
;Macedonian dictionary |
|||
So one way of fixing this would be to have a variable for "position/motion" in the transfer and set it to "position" after a noun, and "motion" after a verb. For example: |
|||
* Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns |
|||
<pre> |
|||
* Added some words from closed categories (adverbs, ...) |
|||
Šumski požar na hrvatskom otoku |
|||
adj n pr adj n |
|||
Forest fire on croatian-LOC island-LOC |
|||
;Bilingual dictionary |
|||
Idem na hrvatski otok |
|||
* Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix |
|||
vblex pr adj n |
|||
I go to croatian-ACC island-ACC |
|||
</pre> |
|||
;Transfer rules |
|||
Of course there are a million counter examples, but if we choose randomly we get 50%, adding this we can increase the accuracy a bit. Another thing to do would be to add information regarding the verb being used, e.g. distinguish between a small set of "movement verbs" and others. This could also increase accuracy. |
|||
== |
==Transfer rules== |
||
Three stage transfer is used. |
|||
Some problems: |
|||
====Future==== |
|||
Genitive constructions, problem in distinguishing partitive from possesive: |
|||
The future tense in Serbo-Croatian is analytic formed with the modal ''hoću'' ("will"). This verb has clitic forms, which inflect: ću, ćeš, će, ćemo, ćete, će. Complicating this, the orthography is different between the East and West standards. East has the clitic joined to the verb, however it can also appear detached. West has the clitic only detached. |
|||
* Čaša vode = Чаша вода [partitive] |
|||
* Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive] |
|||
* TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which: |
|||
** '''More frequently appear as partitive''' |
|||
** '''More frequently appear as possesive''' |
|||
** Can appear either way |
|||
and afterwards ignore the latter case, and make two categories of noun pairs. |
|||
Tricky sequences (instrumental adjectives): |
|||
;Examples |
|||
* "...upravljanje '''velikom''', '''jakom''' ''''pticom'''' " == "...управување '''со''' '''голема''', '''силна''' ''''птица'''' ". |
|||
Only one preposition needed for the whole chain, which can be of any length. |
|||
*:What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM |
|||
*:These sequences will probably be fairly infrequent though... - [[User:Francis Tyers|Francis Tyers]] 09:05, 20 July 2011 (UTC) |
|||
Change of gender in translation: |
|||
{|class=wikitable |
|||
...vožnja'''<f>''' zrakoplovom bila je odlučujuća'''<f>'''... == ...возење'''<nt>''' со авион беше решавачко'''<f>'''. |
|||
! East !! West !! Gloss |
|||
Potentialy too far to be matched. |
|||
|- |
|||
| čitaću || čitat ću || (I) ''will read'' |
|||
|- |
|||
| ću čitati || ću čitati || (I) ''will read'' |
|||
|- |
|||
| videćemo || vidjet ćemo || (We) ''will see'' |
|||
|- |
|||
| ćemo videti || ćemo vidjeti || (We) ''will see'' |
|||
|} |
|||
Obligatory clitic with definite object |
|||
==Macedonian to Serbo-Croatian== |
|||
* I saw the man == Го видов човекот |
|||
The clitics must preceed the finite verb, in this order: |
|||
* subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb |
|||
* ...да не ќе сум си му го дал... == that I won't have given it to him... |
|||
Possesive genitive to на construcions: |
|||
* Anina čaša == чашата на Ана (Ana's glass) |
|||
* Anina ruka == раката на Ана (Ana's hand) |
|||
* (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist) |
|||
Preposition 's' when standing with genitive does not translate as 'со': |
|||
* ...sam poletjela s avionske piste...==...полетав '''од''' авионската писта... and not ...'''со''' авионската писта..., which would have an instrumental meaning |
|||
==Disambiguation rules== |
|||
===Verbs=== |
|||
For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate |
|||
* ...sa njenim toplim nježnim rukama - A sequence in instrumental |
|||
* ...sa svojih toplih nježnih ruku - A sequence in genitive |
|||
Nominative or accusative: |
|||
====Non-finite==== |
|||
* Select Accusative if there's a preceeding Accusative preposition |
|||
* Select Accusative if there's a preceeding transitive verb (direct object rule) |
|||
Genitive or accusative: |
|||
There are various ways in which a verb can be changed into another part of speech that are regular, take for example the verb "gali" (''to caress''): |
|||
* Disambiguate based on prepositions (the intersection is only "u") |
|||
* Select accusative if there's a preceeding transitive verb (direct object rule) |
|||
Dative or locative: |
|||
{|class=wikitable |
|||
* Select dative if there's no preceeding preposition or modifier in dative |
|||
|- |
|||
* Select dative if there's a preceeding dative preposition |
|||
|Verbal adverb || galejĸi || <code>adv</code> |
|||
* Select locative if there's a preceeding locative preposition |
|||
|- |
|||
|Verbal noun || galenje || <code>n.nt.sp</code> |
|||
|- |
|||
|rowspan=4|L-participle (Verbal adjective) || galen || <code>vblex.lp.f.sg</code> |
|||
|- |
|||
| galena || <code>vblex.lp.m.sg</code> |
|||
|- |
|||
| galeno || <code>vblex.lp.nt.sg</code> |
|||
|- |
|||
| galeni || <code>vblex.lp.mf.pl</code> |
|||
|- |
|||
|} |
|||
Instrumental (unambiguous in singular): |
|||
<blockquote> |
|||
* Select instrumental if there's a preceeding instrumental preposition |
|||
"The present active participle survives as the verbal adverb. The past passive participle survives as the verbal adjective, which inflects and behaves like any other adjective and can be formed from any verb, including intransitives. The resultative participle survives as the verbal l-form, which is limited to the sum series, the imal perfect, and the hypothetical conditional." |
|||
* In plural identical to Dative/Locative |
|||
</blockquote> |
|||
** (easily disambiguated from Locative, since the latter is entirely prepositional) |
|||
** Possible problems in the plural with instrumental/non-instrumental, though not yet encountered |
|||
*** "Ljudima sam pomeo pod" - 'To the people' or 'using the people' |
|||
*** "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas') |
|||
Numbers: |
|||
===Tenses=== |
|||
* Numbers 2-4 govern noun phrases differently (remnants of dual), two variants: |
|||
** "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary. |
|||
====Imperfect==== |
|||
** "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech. |
|||
The imperfect tense in Macedonian should be translated as the conjugated form of ''biti'' plus the L-participle in Serbo-Croatian, e.g. |
|||
:беше |
|||
:е+{{sc|pii.p3.sg}} |
|||
:je bio |
|||
:biti+{{sc|pri.p3.sg}} biti+{{sc|lp.m.sg}} |
|||
Note the problem of gender of the L-participle. |
|||
==Known bugs== |
|||
* vreme/vrijeme will do hyperijekavianism when generating West forms. |
|||
==See also== |
==See also== |
||
Line 96: | Line 106: | ||
* [[/Pending tests|Pending tests]] |
* [[/Pending tests|Pending tests]] |
||
* [[/Regression tests|Regression tests]] |
* [[/Regression tests|Regression tests]] |
||
* [[/Final_report|Final report]] |
|||
;TODOs |
|||
* Set up corpus/generation-test |
|||
==External links== |
==External links== |
||
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian] |
* [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian] |
||
* [http://hjp.srce.hr/ |
* [http://hjp.srce.hr/ Hrvatski jezični portal] — Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} ) |
||
* [http://rechnik.on.net.mk/ Macedonian<->Serbian online dictionary] |
|||
* [http://www.makedonski.info/ Word definitions for Macedonian] |
|||
* [http://www.mling.ru/iazik/makedonski/history_makedonski.pdf Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)] |
|||
==Further reading== |
|||
* Ivan Todorović "Disambiguation of Serbian sentences with Unitex". |
|||
==References== |
==References== |
||
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar] |
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar] |
||
* SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar] |
|||
[[Category: |
[[Category:Bosnian-Croatian-Montenegrin-Serbian and Macedonian|*]] |
||
[[Category:Bosnian-Croatian-Montenegrin-Serbian]] |
|||
[[Category:Macedonian]] |
Latest revision as of 05:34, 22 August 2017
Source[edit]
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk
Notes[edit]
- Bosnian-Croatian-Montenegrin-Serbian morphological lexicon
The reflex of yat:
- Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
- Updated the makefile and the xslt machinery so that this works
Verbs(Marked for aspect, transitivity and reflexivity)
- Most of them are from a list extracted from the verbs in the mk monodix
- Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient
Adjectives:
- marked for definiteness
- entered as quadruples (positive, comparative, superlative, absolute superlative)
Other:
- Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix
- Macedonian dictionary
- Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
- Added some words from closed categories (adverbs, ...)
- Bilingual dictionary
- Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
- Transfer rules
Transfer rules[edit]
Three stage transfer is used.
Some problems:
Genitive constructions, problem in distinguishing partitive from possesive:
- Čaša vode = Чаша вода [partitive]
- Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive]
- TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which:
- More frequently appear as partitive
- More frequently appear as possesive
- Can appear either way
and afterwards ignore the latter case, and make two categories of noun pairs.
Tricky sequences (instrumental adjectives):
- "...upravljanje velikom, jakom 'pticom' " == "...управување со голема, силна 'птица' ".
Only one preposition needed for the whole chain, which can be of any length.
- What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM
- These sequences will probably be fairly infrequent though... - Francis Tyers 09:05, 20 July 2011 (UTC)
Change of gender in translation: ...vožnja<f> zrakoplovom bila je odlučujuća<f>... == ...возење<nt> со авион беше решавачко<f>. Potentialy too far to be matched.
Obligatory clitic with definite object
- I saw the man == Го видов човекот
The clitics must preceed the finite verb, in this order:
- subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb
- ...да не ќе сум си му го дал... == that I won't have given it to him...
Possesive genitive to на construcions:
- Anina čaša == чашата на Ана (Ana's glass)
- Anina ruka == раката на Ана (Ana's hand)
- (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist)
Preposition 's' when standing with genitive does not translate as 'со':
- ...sam poletjela s avionske piste...==...полетав од авионската писта... and not ...со авионската писта..., which would have an instrumental meaning
Disambiguation rules[edit]
For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate
- ...sa njenim toplim nježnim rukama - A sequence in instrumental
- ...sa svojih toplih nježnih ruku - A sequence in genitive
Nominative or accusative:
- Select Accusative if there's a preceeding Accusative preposition
- Select Accusative if there's a preceeding transitive verb (direct object rule)
Genitive or accusative:
- Disambiguate based on prepositions (the intersection is only "u")
- Select accusative if there's a preceeding transitive verb (direct object rule)
Dative or locative:
- Select dative if there's no preceeding preposition or modifier in dative
- Select dative if there's a preceeding dative preposition
- Select locative if there's a preceeding locative preposition
Instrumental (unambiguous in singular):
- Select instrumental if there's a preceeding instrumental preposition
- In plural identical to Dative/Locative
- (easily disambiguated from Locative, since the latter is entirely prepositional)
- Possible problems in the plural with instrumental/non-instrumental, though not yet encountered
- "Ljudima sam pomeo pod" - 'To the people' or 'using the people'
- "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas')
Numbers:
- Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:
- "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary.
- "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech.
See also[edit]
- TODOs
- Set up corpus/generation-test
External links[edit]
- Wikipedia: Differences in standard Bosnian, Croatian and Serbian
- Hrvatski jezični portal — Croatian language portal, word definitions with inflection (find the definition and click on izvedeni oblici )
- Macedonian<->Serbian online dictionary
- Word definitions for Macedonian
- Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)
Further reading[edit]
- Ivan Todorović "Disambiguation of Serbian sentences with Unitex".