User:Elmurod1202/GSoC2020 Final Report

Summary[edit]

This project started with a proposal initially named as "State-of-the-art Morphological Analayser for Uzbek language and improved language pairs uz-kk, uz-ky, uz-tr". After discussions with mentors, the best path to make the best of Summer of Code, we decided to cover the Uzbek monolingual package as much as possible together with the Turkish-Uzbek translation pair.

To calculate the coverage of the Uzbek(apertium-uzb) analyser, Uzbek Wikipedia data from 20.05.2020 date with 136K articles(around 13M tokens) was chosen. As for the calculation of trimmed coverage(coverage of a pair limited to the words in the dictionary) of Turkish-Uzbek(apertium-tur-uzb) translation pair, Southeast European Times(SETimes) website data collection in Turkish was used(around 3.7M tokens). In order to calculate word error rate(WER) and position-independent word error rate (PER) of the tur-uzb pair, a parallel text corpora had been created and "James and Mary Story"(~40 sentences) was chosen in our case.

There are still many tasks that have to be finished, such as creating tests for vocabulary(aka Testvoc) and more lexical selection rules(see #Future Work)

Overall, there has been a lot of work on both Uzbek monolingual and Turkish-Uzbek translation packages. Obtained results indicate that goals set initially for Coverage have been met, yet WER/PER results have to be improved.

Repos[edit]

All the contributions can be found at following repositories:

Apertium Turkish monolingual package:
- https://github.com/apertium/apertium-tur
Apertium Uzbek monolingual package:
- https://github.com/apertium/apertium-uzb
Apertium Turkish-Uzbek translation package:
- https://github.com/apertium/apertium-tur-uzb

Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2020/elmurod1202.html .

I have to point out that there are still some more Pull-Requests that haven't been merged yet. Such as these PRs:

Main Work[edit]

Most part of the work done on the Uzbek language was its monodix, which reached more than 55K stems and above 90% coverage on Uzbek Wikipedia. Additional to newly added entries, those entries with wrong tags have been fixed too. There is still a bit work to do with Uzbek monodix, it has to be reorganized cleaned. Furthermore, there were new additions and some fixes to the Turkish monodix as well.

Another major mart of work accomplished during this project is the bilingual dictionary(bidix) of tur-uzb pair which has more than 12K translations now and passed 85% trimmed coverage on SETimes corpus. Lots of newly added entries in the bidix are from mostly-occurring words in the same corpus its trimmed coverage is being calculated. The remaining words are less frequent, but are still being planned to be entered in the future.

Progress Table[edit]

Week		Stems		Tur-Uzb		Naïve Coverage		Progress
№	Dates	uzb	tur-uzb	WER	PER	uzb	tur-uzb	Evaluation	Notes
0	May 4 - May 31	34375	2412	90.80 %	81.60 %	89.57 %	72.14 %	Initial evaluation	As of the end of May
5	June 29 - July 5	34373	2445	84.45 %	76.80 %	90.23 %	72.14 %	First Evaluation	End of June - ~July 3
9	July 27 - Aug 2	34424	4191	78.70 %	68.34 %	90.23 %	72.74 %	Second Evaluation	As of July 31 - Aug 1
10	July 3 - Aug 9	35621	5639	78.70 %	68.64 %	90.28 %	80.14 %	Weekly evaluation	Week #10
11	Aug 10 - Aug 16	37649	8154	78.70 %	68.64 %	90.46 %	83.08 %	Weekly evaluation	Week #11
12	Aug 17 - Aug 23	57406	13023	78.70 %	68.64 %	90.91 %	86.02 %	Weekly evaluation	Week #12
13	Aug 24 - Aug 30	58757	12861	78.70 %	68.64 %	90.94 %	86.03 %	Final evaluation	As of Aug 31

Future Work[edit]

TESTVOC. Due to a lack of time at the end of the project, vocabulary testing was left unfinished.
LEXICON-OV-ICH, the proper lexical rule for Uzbek Cognomens and Patronyms where Cognomen is made as Antrponym+[o/e]v(a) and Patronym is made as Antrponym+[o/e]v[ich/na].
Apertium-Separable, reordering separable/discontiguous multiword elements(MWE) has to be done by moving all MWEs to lsx file.
Reordering and cleaning Uzbek monodix. It has some entries with wrong tags and lots of duplicate entries.
Lexical selection rules. This also helps a lot to reduce WER.

Conclusion[edit]

It has been a great experience for me working with Apertium over the past three months. I could get a solution or an explanation from the community to any obstacle I faced, special thanks to @Firespeaker and @Piraye for always fixing my issues and pointing me in the right direction. I hope to finish all necessaries and see this pair out soon. Planning to work with Apertium on more projects in the future.

User:Elmurod1202/GSoC2020 Final Report

Contents

Summary[edit]

Repos[edit]

Main Work[edit]

Progress Table[edit]

Future Work[edit]

Conclusion[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools