Belarusian and Russian/Final report

From Apertium
Jump to navigation Jump to search

Description

The goal of this project was to implement pair between Belarusian and Russian languages for Apertium. During the project Belarusian monolingual dictionary and bilingual dictionary were created from scratch. Some basic transfer rules were added, the rest is left for the future work.

Belarusian

As mentioned before, Belarusian morphological analyser was created from scratch. The main resource for importing words was dump of bnkorpus[1]. Unfortunately, there was available only outdated dump, that doesn't contain some frequently used words. These words were added semi-manually.

The final version of the Belarusian morphology contains 105215 lemmas.

Adjectives

First problem is comparative and superlative forms for comparable adjectives. They are missed in bnkorpus dump and we didn't find any source for import. It was left to the future and now we have transfer rule for this case.

Participles

Other and the main problem is participles. In Belarusian only past passive participles are frequently used. For other forms of participles were created rules to replace them with "pronoun" + "verb".
Also due to bugs in russian monolingual dictionary we were forced to exclude some verbs from analysis to get clean testvoc.

Post-generation

For support "short U rule"[2] was added additional 2-staged post-generation. Example:

$ echo "~уа ~уаа ~ба ~уа ~у ~уаа ~а" | lt-proc -p bel.pgen-short-u-1.bin | lt-proc -p bel.pgen-short-u-2.bin
уа ўаа ба ўа ў ўаа а

It works in base cases, complicated cases (e.g. word after quote/hyphen) were left for the future.

Russian

Russian morphology has been taken from languages/apertium-rus. During the project additional lemmas have been added by me and others. Some errors (like duplicate lemmas) were fixed. The main problem is in the fact that some verbs and adjectives in Russian monodictionary has incorrect paradigms. There were no other bigger changes in the Russian morphology.

The final version of the Russian morphology contains 84472 lemmas.

Bilingual dictionary

The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries, but main was translation with Belarusian-Russian dictionary[3].

The final version contains 50,488 entries, but some of them were commented out to get clean testvoc in rus→bel.

Transfer rules

Transfer rules were the one of the last parts of the project. We wrote a few transfer rules when we recognized obvious transfer errors. Also we added fourth transfer stage in rus→bel that add "~" to each word (see Belarusian post-generation).
Belarusian and Russian grammars are similar, so our transfer rules partially cover some of existing problems. But there is still work to be done.

Statistics

Commits

Full list of my contribution (commits, diffs) extracted from SVN: https://apertium.projectjj.com/gsoc2016/kvld.html

Dictionaries

  • apertium-bel.bel.dix: 105215 lemmata, 1807128 surface forms
  • apertium-rus.rus.dix: 84472 lemmata, 1770142 surface forms
  • apertium-bel-rus.bel-rus.dix: 48617 entries

Coverage

Coverage for Belarusian and Belarusian→Russian was tested on Belarusian Wikipedia dump.
Coverage for Russian and Russian→Belarusian was tested on Russian Wikipedia dump, that was random shuffled and split into 10 parts.

Coverage Belarusian→Russian (%) Russian→Belarusian (%)
Trimmed coverage (calculated in apertium-bel-rus) 84.3% 83.6%
Coverage Belarusian (%) Russian (%)
Raw coverage (calculated in apertium-bel, apertium-rus) 90.7% 91.1%

Rules

File Belarusian→Russian Russian→Belarusian
.t1x 8 15
.t2x 2 3
.t3x 1 4
.t4x –– 1

Testvoc

Direction Errors
Belarusian→Russian 6,394
Russian→Belarusian 0

Evaluation

During the project next evaluations were completed:

Size bel→rus WER (%) rus→bel WER (%)
1 500 words 59.19% 55.63%
2 500 words 26.79% 27.12%
3 2000 words 25.72% 23.93%

All texts and results (texts after MT and after postedit) can be found in apertium-bel-rus/texts.

Future work

For the future I'm going to continue working on this project. Main goals:

  1. Finish working on bilingual dictionary and monolingual dictionaries, get clean testvoc for bel→rus.
  2. Fix existing errors in Russian dictionary.
  3. Add more transfer rules.
  4. Add missing forms for adjectives and participles to Belarusian dictionary.

References