Belarusian and Russian/Final report
Contents
Description
The goal of this project was to implement pair between Belarusian and Russian languages for Apertium. During the project Belarusian monolingual dictionary and bilingual dictionary were created from scratch. Some basic transfer rules were added, the rest is left for the future work.
Belarusian
As mentioned before, Belarusian morphological analyser was created from scratch. The main resource for importing words was dump of bnkorpus[1]. Unfortunately, there was available only outdated dump, that doesn't contain some frequently used words. These words were added semi-manually.
The final version of the Belarusian morphology contains 105215 lemmas.
Adjectives
First problem is comparative and superlative forms for comparable adjectives. They are missed in bnkorpus dump and we didn't find any source for import. It was left to the future and now we have transfer rule for this case.
Participles
Other and the main problem is participles. In Belarusian only past passive participles are frequently used. For other forms of participles were created rules to replace them with "pronoun" + "verb".
Also due to bugs in russian monolingual dictionary we were forced to exclude some verbs from analysis to get clean testvoc.
Post-generation
For support "short U rule"[2] was added additional 2-staged post-generation. Example:
$ echo "~уа ~уаа ~ба ~уа ~у ~уаа ~а" | lt-proc -p bel.pgen-short-u-1.bin | lt-proc -p bel.pgen-short-u-2.bin уа ўаа ба ўа ў ўаа а
It works in base cases, complicated cases (e.g. word after quote/hyphen) were left for the future.
Russian
Russian morphology has been taken from languages/apertium-rus. During the project additional lemmas have been added by me and others. Some errors (like duplicate lemmas) were fixed. The main problem is in the fact that some verbs and adjectives in Russian monodictionary has incorrect paradigms.
There were no other bigger changes in the Russian morphology.
The final version of the Russian morphology contains 84472 lemmas.
Bilingual dictionary
The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries, but main was translation with Belarusian-Russian dictionary[3].
The final version contains 50,488 entries, but some of them were commented out to get clean testvoc in rus→bel.
Transfer rules
Transfer rules were the one of the last parts of the project. We wrote a few transfer rules when we recognized obvious transfer errors. Also we added fourth transfer stage in rus→bel that add "~" to each word (see Belarusian post-generation).
Belarusian and Russian grammars are similar, so our transfer rules partially cover some of existing problems. But there is still work to be done.
Statistics
Commits
Dictionaries
apertium-bel.bel.dix
: 105215 lemmata, 1807128 surface formsapertium-rus.rus.dix
: 84472 lemmata, 1770142 surface formsapertium-bel-rus.bel-rus.dix
: 48617 entries
Coverage
Coverage for Belarusian and Belarusian→Russian was tested on Belarusian Wikipedia dump.
Coverage for Russian and Russian→Belarusian was tested on Russian Wikipedia dump, that was random shuffled and split into 10 parts.
Coverage | Belarusian→Russian (%) | Russian→Belarusian (%) |
---|---|---|
Trimmed coverage (calculated in apertium-bel-rus) |
84.3% | 83.6% |
Coverage | Belarusian (%) | Russian (%) |
Raw coverage (calculated in apertium-bel, apertium-rus) |
90.7% | 91.1% |
Rules
File | Belarusian→Russian | Russian→Belarusian |
---|---|---|
.t1x | 8 | 15 |
.t2x | 2 | 3 |
.t3x | 1 | 4 |
.t4x | –– | 1 |
Testvoc
Direction | Errors |
---|---|
Belarusian→Russian | 6.394 |
Russian→Belarusian | 0 |
Evaluation
During the project next evaluations were completed:
Size | bel→rus WER (%) | rus→bel WER (%) | |
---|---|---|---|
1 | 500 words | 59.19% | 55.63% |
2 | 500 words | 26.79% | 27.12% |
3 | 2000 words | 25.72% | 23.93% |
All texts and results (texts after MT and after postedit) can be found in apertium-bel-rus/texts.