Difference between revisions of "Belarusian and Russian/Final report"
Line 95: | Line 95: | ||
For the future I'm going to continue working on this project. Main goals: |
For the future I'm going to continue working on this project. Main goals: |
||
# Finish working on bilingual dictionary and monolingual dictionaries, get clean testvoc for bel→rus. |
# Finish working on bilingual dictionary and monolingual dictionaries, get clean testvoc for bel→rus. |
||
# Fix |
# Fix existing errors in Russian dictionary. |
||
# Add more transfer rules. |
# Add more transfer rules. |
||
# Add missing forms for adjectives and participles to Belarusian dictionary. |
# Add missing forms for adjectives and participles to Belarusian dictionary. |
Latest revision as of 21:42, 22 August 2016
Contents
Description[edit]
The goal of this project was to implement pair between Belarusian and Russian languages for Apertium. During the project Belarusian monolingual dictionary and bilingual dictionary were created from scratch. Some basic transfer rules were added, the rest is left for the future work.
Belarusian[edit]
As mentioned before, Belarusian morphological analyser was created from scratch. The main resource for importing words was dump of bnkorpus[1]. Unfortunately, there was available only outdated dump, that doesn't contain some frequently used words. These words were added semi-manually.
The final version of the Belarusian morphology contains 105215 lemmas.
Adjectives[edit]
First problem is comparative and superlative forms for comparable adjectives. They are missed in bnkorpus dump and we didn't find any source for import. It was left to the future and now we have transfer rule for this case.
Participles[edit]
Other and the main problem is participles. In Belarusian only past passive participles are frequently used. For other forms of participles were created rules to replace them with "pronoun" + "verb".
Also due to bugs in russian monolingual dictionary we were forced to exclude some verbs from analysis to get clean testvoc.
Post-generation[edit]
For support "short U rule"[2] was added additional 2-staged post-generation. Example:
$ echo "~уа ~уаа ~ба ~уа ~у ~уаа ~а" | lt-proc -p bel.pgen-short-u-1.bin | lt-proc -p bel.pgen-short-u-2.bin уа ўаа ба ўа ў ўаа а
It works in base cases, complicated cases (e.g. word after quote/hyphen) were left for the future.
Russian[edit]
Russian morphology has been taken from languages/apertium-rus. During the project additional lemmas have been added by me and others. Some errors (like duplicate lemmas) were fixed. The main problem is in the fact that some verbs and adjectives in Russian monodictionary has incorrect paradigms.
There were no other bigger changes in the Russian morphology.
The final version of the Russian morphology contains 84472 lemmas.
Bilingual dictionary[edit]
The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries, but main was translation with Belarusian-Russian dictionary[3].
The final version contains 50,488 entries, but some of them were commented out to get clean testvoc in rus→bel.
Transfer rules[edit]
Transfer rules were the one of the last parts of the project. We wrote a few transfer rules when we recognized obvious transfer errors. Also we added fourth transfer stage in rus→bel that add "~" to each word (see Belarusian post-generation).
Belarusian and Russian grammars are similar, so our transfer rules partially cover some of existing problems. But there is still work to be done.
Statistics[edit]
Commits[edit]
Full list of my contribution (commits, diffs) extracted from SVN: https://apertium.projectjj.com/gsoc2016/kvld.html
Dictionaries[edit]
apertium-bel.bel.dix
: 105215 lemmata, 1807128 surface formsapertium-rus.rus.dix
: 84472 lemmata, 1770142 surface formsapertium-bel-rus.bel-rus.dix
: 48617 entries
Coverage[edit]
Coverage for Belarusian and Belarusian→Russian was tested on Belarusian Wikipedia dump.
Coverage for Russian and Russian→Belarusian was tested on Russian Wikipedia dump, that was random shuffled and split into 10 parts.
Coverage | Belarusian→Russian (%) | Russian→Belarusian (%) |
---|---|---|
Trimmed coverage (calculated in apertium-bel-rus) |
84.3% | 83.6% |
Coverage | Belarusian (%) | Russian (%) |
Raw coverage (calculated in apertium-bel, apertium-rus) |
90.7% | 91.1% |
Rules[edit]
File | Belarusian→Russian | Russian→Belarusian |
---|---|---|
.t1x | 8 | 15 |
.t2x | 2 | 3 |
.t3x | 1 | 4 |
.t4x | –– | 1 |
Testvoc[edit]
Direction | Errors |
---|---|
Belarusian→Russian | 6,394 |
Russian→Belarusian | 0 |
Evaluation[edit]
During the project next evaluations were completed:
Size | bel→rus WER (%) | rus→bel WER (%) | |
---|---|---|---|
1 | 500 words | 59.19% | 55.63% |
2 | 500 words | 26.79% | 27.12% |
3 | 2000 words | 25.72% | 23.93% |
All texts and results (texts after MT and after postedit) can be found in apertium-bel-rus/texts.
Future work[edit]
For the future I'm going to continue working on this project. Main goals:
- Finish working on bilingual dictionary and monolingual dictionaries, get clean testvoc for bel→rus.
- Fix existing errors in Russian dictionary.
- Add more transfer rules.
- Add missing forms for adjectives and participles to Belarusian dictionary.