Difference between revisions of "User:Zigfruid/GSoC Final Report"
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
==Description== |
==Description== |
||
This project began with a proposal originally titled "Develop a prototype machine translation system for the uzb-> kaa strategic language pair." After discussing with the mentors the best way to get the most out of Summer of Code, we decided to |
This project began with a proposal originally titled "Develop a prototype machine translation system for the uzb-> kaa strategic language pair." After discussing with the mentors the best way to get the most out of Summer of Code, we decided to focus on improving the translation accuracy by correcting the Uzbek-Qaraqalpaq bilingual dictionary and the Uzbek and Qaraqalpaq monolingual dictionaries by analysing large corpora and identifying the most common errors. |
||
Tasks included expanding and correcting the contents of the various dictionary files, and also identifying errors in other parts of the translation pipeline which required simple changes. |
|||
There are several more tasks that need to be completed, for example, I analyzed about and found those words that do not correspond to the lexical rules, and it needs to add words to the uzb-kaa.dix file |
|||
In general, a lot of work has been done both on |
In general, a lot of work has been done both on Uzbek and Karakalpak, as well as on packages of Uzbek-Karakalpak translations. The results show that the targets originally set for coverage have almost been met, but the WER / PER results still need to be improved. |
||
==Repositories== |
==Repositories== |
||
All the contributions can be found at following repositories: |
All the contributions can be found at following repositories: |
||
* https://github.com/apertium/apertium-uzb-kaa |
|||
* https://github.com/apertium/apertium-uzb |
|||
* https://github.com/apertium/apertium-kaa |
|||
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/Zigfruid.html . |
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/Zigfruid.html . |
||
Line 15: | Line 18: | ||
* https://github.com/apertium/apertium-uzb/pull/17 |
* https://github.com/apertium/apertium-uzb/pull/17 |
||
* https://github.com/apertium/apertium-kaa/pull/10 |
* https://github.com/apertium/apertium-kaa/pull/10 |
||
==Main Work== |
|||
Furthermore, there were new additions and some fixes to the Karakalpak monodix with Uzbek monodix as well. |
|||
It has been improved the quality of uzb-kaa by focusing on lexical and other translation errors in example texts |
|||
expanded coverage by adding common stems missing in analysis of large corpora |
|||
==Future Work== |
==Future Work== |
||
Line 25: | Line 23: | ||
* Add more words to apertium-kaa.dix file |
* Add more words to apertium-kaa.dix file |
||
* Find words that do not match the lexical rules |
* Find words that do not match the lexical rules |
||
* Try to achieve WER < 40% on |
* Try to achieve WER < 40% on large articles, e.g. from Wikipedia through fixing structural transfer and lexical selection errors |
||
* Identifying and fixing additional transfer errors by using testvoc |
|||
==Conclusion== |
==Conclusion== |
Latest revision as of 18:10, 23 August 2021
Description[edit]
This project began with a proposal originally titled "Develop a prototype machine translation system for the uzb-> kaa strategic language pair." After discussing with the mentors the best way to get the most out of Summer of Code, we decided to focus on improving the translation accuracy by correcting the Uzbek-Qaraqalpaq bilingual dictionary and the Uzbek and Qaraqalpaq monolingual dictionaries by analysing large corpora and identifying the most common errors.
Tasks included expanding and correcting the contents of the various dictionary files, and also identifying errors in other parts of the translation pipeline which required simple changes.
In general, a lot of work has been done both on Uzbek and Karakalpak, as well as on packages of Uzbek-Karakalpak translations. The results show that the targets originally set for coverage have almost been met, but the WER / PER results still need to be improved.
Repositories[edit]
All the contributions can be found at following repositories:
- https://github.com/apertium/apertium-uzb-kaa
- https://github.com/apertium/apertium-uzb
- https://github.com/apertium/apertium-kaa
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/Zigfruid.html .
I have to point out that there are still some more Pull-Requests that haven't been merged yet. Such as these PRs:
Future Work[edit]
- Add more words to apertium-uzb.dix file
- Add more words to apertium-kaa.dix file
- Find words that do not match the lexical rules
- Try to achieve WER < 40% on large articles, e.g. from Wikipedia through fixing structural transfer and lexical selection errors
- Identifying and fixing additional transfer errors by using testvoc
Conclusion[edit]
It has been a great experience for me working with Apertium over the past three months.I learned a lot and gained a lot of experience, and thanks to the mentor @jonorthwash for the constant response, every time If I had a question, he always answered to my questions and helped to me with the project.