Difference between revisions of "Belarusian and Russian/Final report"
Line 11: | Line 11: | ||
Also due to bugs in russian monolingual dictionary we were forced to exclude some verbs from analysis to get clean testvoc. |
Also due to bugs in russian monolingual dictionary we were forced to exclude some verbs from analysis to get clean testvoc. |
||
===Post-generation=== |
===Post-generation=== |
||
For support "short U rule"<ref>[http://www.pravapis.org/fundamental_belarusian.pdf 1.8 Rules for в, у, and ў]</ref> |
For support "short U rule"<ref>[http://www.pravapis.org/fundamental_belarusian.pdf 1.8 Rules for в, у, and ў]</ref> was added additional 2-staged post-generation. Example:<br> |
||
<pre>$ echo "~уа ~уаа ~ба ~уа ~у ~уаа ~а" | lt-proc -p bel.pgen-short-u-1.bin | lt-proc -p bel.pgen-short-u-2.bin |
<pre>$ echo "~уа ~уаа ~ба ~уа ~у ~уаа ~а" | lt-proc -p bel.pgen-short-u-1.bin | lt-proc -p bel.pgen-short-u-2.bin |
||
уа ўаа ба ўа ў ўаа а</pre> |
уа ўаа ба ўа ў ўаа а</pre> |
Revision as of 09:37, 22 August 2016
Contents
Description
The goal of this project was to implement pair between Belarusian and Russian languages for Apertium. During the project Belarusian monolingual dictionary and bilingual dictionary were created from scratch. Some basic transfer rules were added, the rest is left for the future work.
Belarusian
As mentioned before, Belarusian morphological analyser was created from scratch. The main resource for importing words was dump of bnkorpus[1]. Unfortunately, there was available only outdated dump, that doesn't contain some frequently used words. These words were added semi-manually.
The final version of the Belarusian morphology contains 105213 lemmas.
Adjectives
First problem is comparative and superlative forms for comparable adjectives. They are missed in bnkorpus dump and we didn't find any source for import. It was left to the future and now we have transfer rule for this case.
Participles
Other and the main problem is participles. In Belarusian only past passive participles are frequently used. For other forms of participles were created rules to replace them with "pronoun" + "verb".
Also due to bugs in russian monolingual dictionary we were forced to exclude some verbs from analysis to get clean testvoc.
Post-generation
For support "short U rule"[2] was added additional 2-staged post-generation. Example:
$ echo "~уа ~уаа ~ба ~уа ~у ~уаа ~а" | lt-proc -p bel.pgen-short-u-1.bin | lt-proc -p bel.pgen-short-u-2.bin уа ўаа ба ўа ў ўаа а
It works in base cases, complicated cases (e.g. word after quote/hyphen) were left for the future.
Russian
Russian morphology has been taken from languages/apertium-rus. During the project additional lemmas have been added by me and others. Some errors (like duplicate lemmas) were fixed. The main problem is in the fact that some verbs and adjectives in Russian monodictionary has incorrect paradigms.
There were no other bigger changes in the Russian morphology.
The final version of the Russian morphology contains 84485 lemmas.
Bilingual dictionary
The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries, but main was translation with Belarusian-Russian dictionary[3].
The final version contains 50,488 entries, but some of them were commented out to get clean testvoc in rus→bel.
Transfer rules
Transfer rules were the one of the last parts of the project. We wrote a few transfer rules when we recognized obvious transfer errors. Also we added fourth transfer stage in rus→bel that add "~" to each word (see Belarusian post-generation).
Belarusian and Russian grammars are similar, so our transfer rules partially cover some of existing problems. But there is still work to be done.