User:Ifeanyi/GSoC2021 Final Report
Summary
This project started with a proposal initially named as "State-of-the-art Morphological Analayser for Uzbek language and improved language pairs uz-kk, uz-ky, uz-tr". After discussions with mentors, the best path to make the best of Summer of Code, we decided to cover the Uzbek monolingual package as much as possible together with the Turkish-Uzbek translation pair.
To calculate the coverage of the Uzbek(apertium-uzb) analyser, Uzbek Wikipedia data from 20.05.2020 date with 136K articles(around 13M tokens) was chosen. As for the calculation of trimmed coverage(coverage of a pair limited to the words in the dictionary) of Turkish-Uzbek(apertium-tur-uzb) translation pair, Southeast European Times(SETimes) website data collection in Turkish was used(around 3.7M tokens). In order to calculate word error rate(WER) and position-independent word error rate (PER) of the tur-uzb pair, a parallel text corpora had been created and "James and Mary Story"(~40 sentences) was chosen in our case.
There are still many tasks that have to be finished, such as creating tests for vocabulary(aka Testvoc) and more lexical selection rules(see #Future Work)
Overall, there has been a lot of work on both Uzbek monolingual and Turkish-Uzbek translation packages. Obtained results indicate that goals set initially for Coverage have been met, yet WER/PER results have to be improved.
Repos
All the contributions can be found at following repositories:
- Apertium Turkish monolingual package:
- Apertium Uzbek monolingual package:
- Apertium Turkish-Uzbek translation package:
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2020/elmurod1202.html .
I have to point out that there are still some more Pull-Requests that haven't been merged yet. Such as these PRs:
- https://github.com/apertium/apertium-tur/pull/5
- https://github.com/apertium/apertium-uzb/pull/11
- https://github.com/apertium/apertium-tur-uzb/pull/4
Main Work
Most part of the work done on the Uzbek language was its monodix, which reached more than 55K stems and above 90% coverage on Uzbek Wikipedia. Additional to newly added entries, those entries with wrong tags have been fixed too. There is still a bit work to do with Uzbek monodix, it has to be reorganized cleaned. Furthermore, there were new additions and some fixes to the Turkish monodix as well.
Another major mart of work accomplished during this project is the bilingual dictionary(bidix) of tur-uzb pair which has more than 12K translations now and passed 85% trimmed coverage on SETimes corpus. Lots of newly added entries in the bidix are from mostly-occurring words in the same corpus its trimmed coverage is being calculated. The remaining words are less frequent, but are still being planned to be entered in the future.
Progress Table
Week | Stems | Tur-Uzb | Naïve Coverage | Progress | |||||
---|---|---|---|---|---|---|---|---|---|
№ | Dates | uzb | tur-uzb | WER | PER | uzb | tur-uzb | Evaluation | Notes |
0 | May 4 - May 31 | 34375 | 2412 | 90.80 % | 81.60 % | 89.57 % | 72.14 % | Initial evaluation | As of the end of May |
5 | June 29 - July 5 | 34373 | 2445 | 84.45 % | 76.80 % | 90.23 % | 72.14 % | First Evaluation | End of June - ~July 3 |
9 | July 27 - Aug 2 | 34424 | 4191 | 78.70 % | 68.34 % | 90.23 % | 72.74 % | Second Evaluation | As of July 31 - Aug 1 |
10 | July 3 - Aug 9 | 35621 | 5639 | 78.70 % | 68.64 % | 90.28 % | 80.14 % | Weekly evaluation | Week #10 |
11 | Aug 10 - Aug 16 | 37649 | 8154 | 78.70 % | 68.64 % | 90.46 % | 83.08 % | Weekly evaluation | Week #11 |
12 | Aug 17 - Aug 23 | 57406 | 13023 | 78.70 % | 68.64 % | 90.91 % | 86.02 % | Weekly evaluation | Week #12 |
13 | Aug 24 - Aug 30 | 58757 | 12861 | 78.70 % | 68.64 % | 90.94 % | 86.03 % | Final evaluation | As of Aug 31 |
Future Work
- TESTVOC. Due to a lack of time at the end of the project, vocabulary testing was left unfinished.
- LEXICON-OV-ICH, the proper lexical rule for Uzbek Cognomens and Patronyms where Cognomen is made as Antrponym+[o/e]v(a) and Patronym is made as Antrponym+[o/e]v[ich/na].
- Apertium-Separable, reordering separable/discontiguous multiword elements(MWE) has to be done by moving all MWEs to lsx file.
- Reordering and cleaning Uzbek monodix. It has some entries with wrong tags and lots of duplicate entries.
- Lexical selection rules. This also helps a lot to reduce WER.
Conclusion
It has been a great experience for me working with Apertium over the past three months. I could get a solution or an explanation from the community to any obstacle I faced, special thanks to @Firespeaker and @Piraye for always fixing my issues and pointing me in the right direction. I hope to finish all necessaries and see this pair out soon. Planning to work with Apertium on more projects in the future.