Difference between revisions of "Bosnian-Croatian-Montenegrin-Serbian and Macedonian"
|  (→Todo) | Firespeaker (talk | contribs)  | ||
| (36 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
| {{TOCD}} | {{TOCD}} | ||
| ==Source== | |||
| =Progress of the work in the bonding period= | |||
| Insofar, a new dictionary has been started from scratch, some paradigms added from the grammar of croatian, along with some closed word categories. For details see the Todo list. | |||
| <pre> | |||
| ==Todo== | |||
| https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk | |||
| </pre> | |||
| ==Notes== | |||
| ;Testing framework | |||
| ;Bosnian-Croatian-Montenegrin-Serbian morphological lexicon | |||
| * <s>Set up pending/regression tests framework</s> | |||
| * Set testvoc  | |||
| * Set up corpus/generation-test | |||
| ;Serbo-Croatian dictionary | |||
| The reflex of yat: | The reflex of yat: | ||
| *  | * Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian | ||
| * Updated the makefile and the xslt machinery so that this works | |||
| Modal verbs: | |||
| *<s>Verb to be (biti)</s> | |||
| Verbs(Marked for aspect, transitivity and reflexivity) | |||
| *<s>Clitic verb htjeti, to mark future,</s> | |||
| * Most of them are from a list extracted from the verbs in the mk monodix | |||
| Verbs(Marked for aspect and transitivity) | |||
| * Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient | |||
| *<s>suffixes for present, imperfect and aorist</s> | |||
| *the l-participle needs more detailed marking, behaves differently in respect to number | |||
| Adjectives: | Adjectives: | ||
| * marked for definiteness | |||
| * <s>One paradigm added, with quite extensive marking</s> | |||
| * entered as quadruples (positive, comparative, superlative, absolute superlative) | |||
| Other: | |||
| * Add the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix [in progress] | |||
| * Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix | |||
| ** <s>prepositions (including the ones of type s/sa and k/ka, which need to be postprocessed in generation)</s> | |||
| ** <s>conjunctions</s> | |||
| ** <s>interjections</s> | |||
| ** particles | |||
| ** nouns (masculine, feminine, neuter) | |||
| ** adjectives (the definite and indefinite form paradigms) | |||
| ** verbs | |||
| * <s>Add the personal clitic and non-clitic pronouns</s>, <s>add the reflexive clitic and non-clitic pronouns</s>, possesive, interrogative, relational, demonstrative (pronoun, and demonstrative adjective), indefinite, negative ... | |||
| * <s>Add the clitic form of the verb to be</s>, <s>the long present form</s>, other tenses auxilliary verbs | |||
| * Obtain a grammar of Serbian, for reference on differences | |||
| ;Macedonian dictionary | ;Macedonian dictionary | ||
| *  | * Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns | ||
| * Added some words from closed categories (adverbs, ...) | |||
| ;Bilingual dictionary | ;Bilingual dictionary | ||
| *  | * Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix | ||
| ;Transfer rules | ;Transfer rules | ||
| ==Transfer rules== | |||
| Three stage transfer is used. | |||
| Some problems: | |||
| Genitive constructions, problem in distinguishing partitive from possesive: | |||
| * Čaša vode = Чаша вода [partitive] | |||
| * Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive] | |||
| * TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which: | |||
| ** '''More frequently appear as partitive''' | |||
| ** '''More frequently appear as possesive''' | |||
| ** Can appear either way | |||
| and afterwards ignore the latter case, and make two categories of noun pairs. | |||
| Tricky sequences (instrumental adjectives):  | |||
| * "...upravljanje '''velikom''', '''jakom''' ''''pticom'''' " == "...управување '''со''' '''голема''', '''силна''' ''''птица'''' ". | |||
| Only one preposition needed for the whole chain, which can be of any length. | |||
| *:What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM | |||
| *:These sequences will probably be fairly infrequent though... - [[User:Francis Tyers|Francis Tyers]] 09:05, 20 July 2011 (UTC) | |||
| Change of gender in translation: | |||
| ...vožnja'''<f>''' zrakoplovom bila je odlučujuća'''<f>'''... == ...возење'''<nt>''' со авион беше решавачко'''<f>'''. | |||
| Potentialy too far to be matched. | |||
| Obligatory clitic with definite object | |||
| * I saw the man == Го видов човекот | |||
| The clitics must preceed the finite verb, in this order: | |||
| * subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb | |||
| * ...да не ќе сум си му го дал... == that I won't have given it to him... | |||
| Possesive genitive to на construcions: | |||
| * Anina čaša == чашата на Ана (Ana's glass) | |||
| * Anina ruka == раката на Ана (Ana's hand) | |||
| * (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist) | |||
| Preposition 's' when standing with genitive does not translate as 'со': | |||
| * ...sam poletjela s avionske piste...==...полетав '''од''' авионската писта... and not ...'''со''' авионската писта..., which would have an instrumental meaning | |||
| ==Disambiguation rules== | |||
| For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate | |||
| * ...sa njenim toplim nježnim rukama - A sequence in instrumental | |||
| * ...sa svojih toplih nježnih ruku - A sequence in genitive | |||
| Nominative or accusative: | |||
| * Select Accusative if there's a preceeding Accusative preposition | |||
| * Select Accusative if there's a preceeding transitive verb (direct object rule) | |||
| Genitive or accusative: | |||
| * Disambiguate based on prepositions (the intersection is only "u") | |||
| * Select accusative if there's a preceeding transitive verb (direct object rule) | |||
| Dative or locative: | |||
| * Select dative if there's no preceeding preposition or modifier in dative | |||
| * Select dative if there's a preceeding dative preposition | |||
| * Select locative if there's a preceeding locative preposition | |||
| Instrumental (unambiguous in singular): | |||
| * Select instrumental if there's a preceeding instrumental preposition | |||
| * In plural identical to Dative/Locative  | |||
| ** (easily disambiguated from Locative, since the latter is entirely prepositional) | |||
| ** Possible problems in the plural with instrumental/non-instrumental, though not yet encountered | |||
| *** "Ljudima sam pomeo pod" - 'To the people' or 'using the people' | |||
| *** "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas') | |||
| Numbers: | |||
| * Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:  | |||
| ** "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary. | |||
| ** "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech. | |||
| ==See also== | ==See also== | ||
| Line 49: | Line 106: | ||
| * [[/Pending tests|Pending tests]] | * [[/Pending tests|Pending tests]] | ||
| * [[/Regression tests|Regression tests]] | * [[/Regression tests|Regression tests]] | ||
| * [[/Final_report|Final report]] | |||
| ;TODOs | |||
| * Set up corpus/generation-test | |||
| ==External links== | ==External links== | ||
| Line 54: | Line 115: | ||
| * [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian] | * [http://en.wikipedia.org/wiki/Differences_in_standard_Serbian%2C_Croatian_and_Bosnian Wikipedia: Differences in standard Bosnian, Croatian and Serbian] | ||
| * [http://hjp.srce.hr/ Hrvatski jezični portal] — Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} ) | * [http://hjp.srce.hr/ Hrvatski jezični portal] — Croatian language portal, word definitions with inflection (find the definition and click on {{sc|izvedeni oblici}} ) | ||
| * [http://rechnik.on.net.mk/ Macedonian<->Serbian online dictionary] | |||
| * [http://www.makedonski.info/ Word definitions for Macedonian] | |||
| * [http://www.mling.ru/iazik/makedonski/history_makedonski.pdf Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)] | |||
| ==Further reading== | |||
| * Ivan Todorović "Disambiguation of Serbian sentences with Unitex". | |||
| ==References== | ==References== | ||
| Line 59: | Line 127: | ||
| * SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar] | * SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_bcs.pdf Bosnian / Croatian / Serbian Reference Grammar] | ||
| * SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar] | * SEELRC [http://www.seelrc.org:8080/grammar/pdf/stand_alone_macedonian.pdf Macedonian Grammar] | ||
| [[Category: | [[Category:Bosnian-Croatian-Montenegrin-Serbian and Macedonian|*]] | ||
| [[Category:Bosnian-Croatian-Montenegrin-Serbian]] | |||
| [[Category:Macedonian]] | |||
Latest revision as of 05:34, 22 August 2017
Source[edit]
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sh-mk
Notes[edit]
- Bosnian-Croatian-Montenegrin-Serbian morphological lexicon
The reflex of yat:
- Added two additional modes to the monodix (ek/ijek) so that lemmas containing yat can be analysed both as ekavian and ijekavian
- Updated the makefile and the xslt machinery so that this works
Verbs(Marked for aspect, transitivity and reflexivity)
- Most of them are from a list extracted from the verbs in the mk monodix
- Verb paradigm names are fashioned to contain information about the verb ("aspect_transitivity_paradig/m__vblex"), makes input more convenient
Adjectives:
- marked for definiteness
- entered as quadruples (positive, comparative, superlative, absolute superlative)
Other:
- Added the paradigms from the grammar of Croatian (the one by Barić, Lončarić, Malić, Pavešić, Peti, Zečević, Znika) to the sh monodix
- Macedonian dictionary
- Added determiner forms for some pronouns (e.g demonstratives, possessives, etc.) -- things that can modify nouns
- Added some words from closed categories (adverbs, ...)
- Bilingual dictionary
- Updated the pronoun entries, the symbols in the monodix have been adjusted to correspond more closely to the analysis in the macedonian monodix
- Transfer rules
Transfer rules[edit]
Three stage transfer is used.
Some problems:
Genitive constructions, problem in distinguishing partitive from possesive:
- Čaša vode = Чаша вода [partitive]
- Pilotkinje Vojske Srbije == Пилотките на Српската војска [possesive]
- TODO: After the lexicon in the analyser becomes sufficiently large grep out all "(Noun) (Noun + Genitive)" occurences in i.e. the wikipedia corpus, and find pairs which:
- More frequently appear as partitive
- More frequently appear as possesive
- Can appear either way
 
and afterwards ignore the latter case, and make two categories of noun pairs.
Tricky sequences (instrumental adjectives):
- "...upravljanje velikom, jakom 'pticom' " == "...управување со голема, силна 'птица' ".
Only one preposition needed for the whole chain, which can be of any length.
- What we do here is just make long chunks in the t1x, e.g. in this case: ADJ ADJ CM ADJ NOM
- These sequences will probably be fairly infrequent though... - Francis Tyers 09:05, 20 July 2011 (UTC)
 
Change of gender in translation: ...vožnja<f> zrakoplovom bila je odlučujuća<f>... == ...возење<nt> со авион беше решавачко<f>. Potentialy too far to be matched.
Obligatory clitic with definite object
- I saw the man == Го видов човекот
The clitics must preceed the finite verb, in this order:
- subjunctive-negative-mood-aux-ethical dative-dative object-accusative object-verb
- ...да не ќе сум си му го дал... == that I won't have given it to him...
Possesive genitive to на construcions:
- Anina čaša == чашата на Ана (Ana's glass)
- Anina ruka == раката на Ана (Ana's hand)
- (possible TODO in the mk analyser: add an analysis for possesive adjectives, currently doesn't exist)
Preposition 's' when standing with genitive does not translate as 'со':
- ...sam poletjela s avionske piste...==...полетав од авионската писта... and not ...со авионската писта..., which would have an instrumental meaning
Disambiguation rules[edit]
For all cases modifiers (adjectives, numbers, pronouns) transfer number, case and gender, so instances like these were used to disambiguate
- ...sa njenim toplim nježnim rukama - A sequence in instrumental
- ...sa svojih toplih nježnih ruku - A sequence in genitive
Nominative or accusative:
- Select Accusative if there's a preceeding Accusative preposition
- Select Accusative if there's a preceeding transitive verb (direct object rule)
Genitive or accusative:
- Disambiguate based on prepositions (the intersection is only "u")
- Select accusative if there's a preceeding transitive verb (direct object rule)
Dative or locative:
- Select dative if there's no preceeding preposition or modifier in dative
- Select dative if there's a preceeding dative preposition
- Select locative if there's a preceeding locative preposition
Instrumental (unambiguous in singular):
- Select instrumental if there's a preceeding instrumental preposition
- In plural identical to Dative/Locative
- (easily disambiguated from Locative, since the latter is entirely prepositional)
- Possible problems in the plural with instrumental/non-instrumental, though not yet encountered
- "Ljudima sam pomeo pod" - 'To the people' or 'using the people'
- "Ploviti morima" - To sail 'to the seas' or 'across seas' (or 'using the seas')
 
 
Numbers:
- Numbers 2-4 govern noun phrases differently (remnants of dual), two variants:
- "s trima lijepim ženama" (with three women). The number and the rest of the phrase take the dual forms (Nom,Acc,Voc==Gen.Sg, Gen=>Gen.Pl (triju lijepih žena), Dat,Loc,Ins=>Ins.Pl (trima lijepim ženama) ). This variant is more literary.
- "s tri lijepe žene" (the number is in a frozen form, and the rest of the phrase gets genitive, the actual meaning is determined from context or prepositions). This variant is closer to actual speech.
 
See also[edit]
- TODOs
- Set up corpus/generation-test
External links[edit]
- Wikipedia: Differences in standard Bosnian, Croatian and Serbian
- Hrvatski jezični portal — Croatian language portal, word definitions with inflection (find the definition and click on izvedeni oblici )
- Macedonian<->Serbian online dictionary
- Word definitions for Macedonian
- Блаже Конески - Историја на македонскиот јазик (Blaže Koneski - History of the Macedonian Language; in Macedonian, cyrilic)
Further reading[edit]
- Ivan Todorović "Disambiguation of Serbian sentences with Unitex".

