User:Zigfruid/GSoC Final Report
This project began with a proposal originally titled "Develop a prototype machine translation system for the uzb-> kaa strategic language pair." After discussing with the mentors the best way to get the most out of Summer of Code, we decided to focus on improving the translation accuracy by correcting the Uzbek-Qaraqalpaq bilingual dictionary and the Uzbek and Qaraqalpaq monolingual dictionaries by analysing large corpora and identifying the most common errors.
Tasks included expanding and correcting the contents of the various dictionary files, and also identifying errors in other parts of the translation pipeline which required simple changes.
In general, a lot of work has been done both on packages of Turic translations into Uzbek and Karakalpak, as well as on packages of Uzbek-Karakalpak translations. The results show that the targets originally set for coverage have almost been met, but the WER / PER results still need to be improved.
All the contributions can be found at following repositories:
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/Zigfruid.html .
I have to point out that there are still some more Pull-Requests that haven't been merged yet. Such as these PRs:
- Add more words to apertium-uzb.dix file
- Add more words to apertium-kaa.dix file
- Find words that do not match the lexical rules
- Try to achieve WER < 40% on large articles, e.g. from Wikipedia through fixing structural transfer and lexical selection errors
- Identifying and fixing additional transfer errors by using testvoc
It has been a great experience for me working with Apertium over the past three months.I learned a lot and gained a lot of experience, and thanks to the mentor @jonorthwash for the constant response, every time If I had a question, he always answered to my questions and helped to me with the project.