Difference between revisions of "User:Ifeanyi/GSoC2021 Final Report"

From Apertium
Jump to navigation Jump to search
(Created page with "==Summary== This project started with a proposal initially named as "State-of-the-art Morphological Analayser for Uzbek language and improved language pairs uz-kk, uz-ky, uz-t...")
(No difference)

Revision as of 12:59, 20 August 2021

Summary

This project started with a proposal initially named as "State-of-the-art Morphological Analayser for Uzbek language and improved language pairs uz-kk, uz-ky, uz-tr". After discussions with mentors, the best path to make the best of Summer of Code, we decided to cover the Uzbek monolingual package as much as possible together with the Turkish-Uzbek translation pair.

To calculate the coverage of the Uzbek(apertium-uzb) analyser, Uzbek Wikipedia data from 20.05.2020 date with 136K articles(around 13M tokens) was chosen. As for the calculation of trimmed coverage(coverage of a pair limited to the words in the dictionary) of Turkish-Uzbek(apertium-tur-uzb) translation pair, Southeast European Times(SETimes) website data collection in Turkish was used(around 3.7M tokens). In order to calculate word error rate(WER) and position-independent word error rate (PER) of the tur-uzb pair, a parallel text corpora had been created and "James and Mary Story"(~40 sentences) was chosen in our case.

There are still many tasks that have to be finished, such as creating tests for vocabulary(aka Testvoc) and more lexical selection rules(see #Future Work)

Overall, there has been a lot of work on both Uzbek monolingual and Turkish-Uzbek translation packages. Obtained results indicate that goals set initially for Coverage have been met, yet WER/PER results have to be improved.

Repos

All the contributions can be found at following repositories:

Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2020/elmurod1202.html .

I have to point out that there are still some more Pull-Requests that haven't been merged yet. Such as these PRs:

Main Work

Most part of the work done on the Uzbek language was its monodix, which reached more than 55K stems and above 90% coverage on Uzbek Wikipedia. Additional to newly added entries, those entries with wrong tags have been fixed too. There is still a bit work to do with Uzbek monodix, it has to be reorganized cleaned. Furthermore, there were new additions and some fixes to the Turkish monodix as well.

Another major mart of work accomplished during this project is the bilingual dictionary(bidix) of tur-uzb pair which has more than 12K translations now and passed 85% trimmed coverage on SETimes corpus. Lots of newly added entries in the bidix are from mostly-occurring words in the same corpus its trimmed coverage is being calculated. The remaining words are less frequent, but are still being planned to be entered in the future.

Progress Table

Week Stems Tur-Uzb Naïve Coverage Progress
Dates uzb tur-uzb WER PER uzb tur-uzb Evaluation Notes
0 May 4 - May 31 34375 2412 90.80 % 81.60 % 89.57 % 72.14 % Initial evaluation As of the end of May
5 June 29 - July 5 34373 2445 84.45 % 76.80 % 90.23 % 72.14 % First Evaluation End of June - ~July 3
9 July 27 - Aug 2 34424 4191 78.70 % 68.34 % 90.23 % 72.74 % Second Evaluation As of July 31 - Aug 1
10 July 3 - Aug 9 35621 5639 78.70 % 68.64 % 90.28 % 80.14 % Weekly evaluation Week #10
11 Aug 10 - Aug 16 37649 8154 78.70 % 68.64 % 90.46 % 83.08 % Weekly evaluation Week #11
12 Aug 17 - Aug 23 57406 13023 78.70 % 68.64 % 90.91 % 86.02 % Weekly evaluation Week #12
13 Aug 24 - Aug 30 58757 12861 78.70 % 68.64 % 90.94 % 86.03 % Final evaluation As of Aug 31

Future Work

  • TESTVOC. Due to a lack of time at the end of the project, vocabulary testing was left unfinished.
  • LEXICON-OV-ICH, the proper lexical rule for Uzbek Cognomens and Patronyms where Cognomen is made as Antrponym+[o/e]v(a) and Patronym is made as Antrponym+[o/e]v[ich/na].
  • Apertium-Separable, reordering separable/discontiguous multiword elements(MWE) has to be done by moving all MWEs to lsx file.
  • Reordering and cleaning Uzbek monodix. It has some entries with wrong tags and lots of duplicate entries.
  • Lexical selection rules. This also helps a lot to reduce WER.

Conclusion

It has been a great experience for me working with Apertium over the past three months. I could get a solution or an explanation from the community to any obstacle I faced, special thanks to @Firespeaker and @Piraye for always fixing my issues and pointing me in the right direction. I hope to finish all necessaries and see this pair out soon. Planning to work with Apertium on more projects in the future.