Belarusian and Russian/Final report

From Apertium
Jump to navigation Jump to search

Description

The goal of this project was to implement pair between Belarusian and Russian languages for Apertium. During the project Belarusian monolingual dictionary and bilingual dictionary were created from scratch. Some basic transfer rules were added, the rest is left for the future work.

Belarusian

As mentioned before, Belarusian morphological analyser was created from scratch. The main resource for importing words was dump of bnkorpus[1]. Unfortunately, there was available only outdated dump, that doesn't contain some frequently used words. These words were added semi-manually.

The final version of the Belarusian morphology contains 105213 lemmas.

Adjectives

First problem is comparative and superlative forms for comparable adjectives. They are missed in bnkorpus dump and we didn't find any source for import. It was left to the future and now we have transfer rule for this case.

Participles

Other and the main problem is participles. In Belarusian only past passive participles are frequently used. For other forms of participles were created rules to replace them with "pronoun" + "verb".
Also due to bugs in russian monolingual dictionary we were forced to exclude some verbs from analysis to get clean testvoc.

Post-generation

For support "short U rule"[2] were added additional 2-staged post-generation. Example:

$ echo "~уа ~уаа ~ба ~уа ~у ~уаа ~а" | lt-proc -p bel.pgen-short-u-1.bin | lt-proc -p bel.pgen-short-u-2.bin
уа ўаа ба ўа ў ўаа а

It works in base cases, complicated cases (e.g. word after quote/hyphen) were left for the future.

Russian

Russian morphology has been taken from languages/apertium-rus. During the project additional lemmas have been added by me and others. Some errors (like duplicate lemmas) were fixed. The main problem is in the fact that some verbs and adjectives in Russian monodictionary has incorrect paradigms. There were no other bigger changes in the Russian morphology.

The final version of the Russian morphology contains 84485 lemmas.

Bilingual dictionary

The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries, but main was translation with Belarusian-Russian dictionary[3].

The final version contains 50,488 entries, but some of them were commented out to get clean testvoc in rus→bel.

Transfer rules

Transfer rules were the one of the last parts of the project. We wrote a few transfer rules when we recognized obvious transfer errors. Also we added fourth transfer stage in rus→bel that add "~" to each word (see Belarusian post-generation).
Belarusian and Russian grammars are similar, so our transfer rules partially cover some of existing problems. But there is still work to be done.

Statistics

Commits

Dictionaries

Coverage

Rules

Testvoc

Evaluation

Future work

References