Difference between revisions of "User:Zigfruid/GSoC Final Report"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
==Description==
==Description==
This project began with a proposal originally titled "Develop a prototype machine translation system for the uzb-> kaa strategic language pair." After discussing with the mentors the best way to get the most out of Summer of Code, we decided to cover the Uzbek monolingual package as much as possible along with a couple of Uzbek-Karakalpak translations.
This project began with a proposal originally titled "Develop a prototype machine translation system for the uzb-> kaa strategic language pair." After discussing with the mentors the best way to get the most out of Summer of Code, we decided to focus on improving the translation accuracy by correcting the Uzbek-Qaraqalpaq bilingual dictionary and the Uzbek and Qaraqalpaq monolingual dictionaries by analysing large corpora and identifying the most common errors.


Tasks included expanding and correcting the contents of the various dictionary files, and also identifying errors in other parts of the translation pipeline which required simple changes.
There are several more tasks that need to be completed, for example, I analyzed about and found those words that do not correspond to the lexical rules, and it needs to add words to the uzb-kaa.dix file


In general, a lot of work has been done both on packages of Turic translations into Uzbek and Karakalpak, as well as on packages of Uzbek-Karakalpak translations. The results show that the targets originally set for coverage have almost been met, but the WER / PER results need to be improved..
In general, a lot of work has been done both on packages of Turic translations into Uzbek and Karakalpak, as well as on packages of Uzbek-Karakalpak translations. The results show that the targets originally set for coverage have almost been met, but the WER / PER results still need to be improved.


==Repositories==
==Repositories==
All the contributions can be found at following repositories: https://github.com/apertium/apertium-uzb-kaa
All the contributions can be found at following repositories:
* https://github.com/apertium/apertium-uzb-kaa
* https://github.com/apertium/apertium-uzb
* https://github.com/apertium/apertium-kaa


Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/Zigfruid.html .
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/Zigfruid.html .
Line 15: Line 18:
* https://github.com/apertium/apertium-uzb/pull/17
* https://github.com/apertium/apertium-uzb/pull/17
* https://github.com/apertium/apertium-kaa/pull/10
* https://github.com/apertium/apertium-kaa/pull/10

==Main Work==
Furthermore, there were new additions and some fixes to the Karakalpak monodix with Uzbek monodix as well.
It has been improved the quality of uzb-kaa by focusing on lexical and other translation errors in example texts
expanded coverage by adding common stems missing in analysis of large corpora


==Future Work==
==Future Work==
Line 25: Line 23:
* Add more words to apertium-kaa.dix file
* Add more words to apertium-kaa.dix file
* Find words that do not match the lexical rules
* Find words that do not match the lexical rules
* Try to achieve WER < 40% on the large articles on wiki
* Try to achieve WER < 40% on large articles, e.g. from Wikipedia through fixing structural transfer and lexical selection errors
* Identifying and fixing additional transfer errors by using testvoc


==Conclusion==
==Conclusion==

Revision as of 17:00, 23 August 2021

Description

This project began with a proposal originally titled "Develop a prototype machine translation system for the uzb-> kaa strategic language pair." After discussing with the mentors the best way to get the most out of Summer of Code, we decided to focus on improving the translation accuracy by correcting the Uzbek-Qaraqalpaq bilingual dictionary and the Uzbek and Qaraqalpaq monolingual dictionaries by analysing large corpora and identifying the most common errors.

Tasks included expanding and correcting the contents of the various dictionary files, and also identifying errors in other parts of the translation pipeline which required simple changes.

In general, a lot of work has been done both on packages of Turic translations into Uzbek and Karakalpak, as well as on packages of Uzbek-Karakalpak translations. The results show that the targets originally set for coverage have almost been met, but the WER / PER results still need to be improved.

Repositories

All the contributions can be found at following repositories:

Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/Zigfruid.html .

I have to point out that there are still some more Pull-Requests that haven't been merged yet. Such as these PRs:

Future Work

  • Add more words to apertium-uzb.dix file
  • Add more words to apertium-kaa.dix file
  • Find words that do not match the lexical rules
  • Try to achieve WER < 40% on large articles, e.g. from Wikipedia through fixing structural transfer and lexical selection errors
  • Identifying and fixing additional transfer errors by using testvoc

Conclusion

It has been a great experience for me working with Apertium over the past three months.I learned a lot and gained a lot of experience, and thanks to the mentor @jonorthwash for the constant response, every time If I had a question, he always answered to my questions and helped to me with the project.