Difference between revisions of "User:Ifeanyi/GSoC2021 Final Report"
(Created page with "==Summary== This project started with a proposal initially named as "State-of-the-art Morphological Analayser for Uzbek language and improved language pairs uz-kk, uz-ky, uz-t...") |
|||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==Summary== |
==Summary== |
||
The goal of this project was to Develop a morphological analyser for language pair for English-Igbo and write a usable version which provides intelligible output. After discussions with mentors, the best way to make the best out of Summer of Code, we decided to improve Ibo monolingual coverage package as much as possible. |
|||
==Main Work== |
|||
To calculate the coverage of the Uzbek(apertium-uzb) analyser, Uzbek Wikipedia data from 20.05.2020 date with 136K articles(around 13M tokens) was chosen. As for the calculation of trimmed coverage(coverage of a pair limited to the words in the dictionary) of Turkish-Uzbek(apertium-tur-uzb) translation pair, Southeast European Times(SETimes) website data collection in Turkish was used(around 3.7M tokens). |
|||
In order to calculate word error rate(WER) and position-independent word error rate (PER) of the tur-uzb pair, a parallel text corpora had been created and "James and Mary Story"(~40 sentences) was chosen in our case. |
|||
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/ifeanyijasper.html . |
|||
There are still many tasks that have to be finished, such as creating tests for vocabulary(aka Testvoc) and more lexical selection rules(see #Future Work) |
|||
Most part of the work done on the ibo language was its monodix. This consisted of adding stems to dictionaries, I was able to expand coverage of the Igbo analyser from a prototype analyser to one with wide coverage (although still not production-ready) |
|||
Overall, there has been a lot of work on both Uzbek monolingual and Turkish-Uzbek translation packages. Obtained results indicate that goals set initially for Coverage have been met, yet WER/PER results have to be improved. |
|||
===ibo morphological analyser coverage=== |
|||
==Repos== |
|||
All the contributions can be found at following repositories: |
|||
* Apertium Turkish monolingual package: |
|||
** https://github.com/apertium/apertium-tur |
|||
* Apertium Uzbek monolingual package: |
|||
** https://github.com/apertium/apertium-uzb |
|||
* Apertium Turkish-Uzbek translation package: |
|||
** https://github.com/apertium/apertium-tur-uzb |
|||
{| class="wikitable" |
|||
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2020/elmurod1202.html . |
|||
|- |
|||
! Corpus |
|||
! Words |
|||
! Coverage before |
|||
! Coverage after |
|||
|- |
|||
| Wikipedia |
|||
| 511550 |
|||
| 19.09% |
|||
| 68.52% |
|||
|} |
|||
I have to point out that there are still some more Pull-Requests that haven't been merged yet. |
|||
Such as these PRs: |
|||
* https://github.com/apertium/apertium-tur/pull/5 |
|||
* https://github.com/apertium/apertium-uzb/pull/11 |
|||
* https://github.com/apertium/apertium-tur-uzb/pull/4 |
|||
===ibo lexicon sizes Before=== |
|||
==Main Work== |
|||
Most part of the work done on the Uzbek language was its monodix, which reached more than 55K stems and above 90% coverage on Uzbek Wikipedia. Additional to newly added entries, those entries with wrong tags have been fixed too. There is still a bit work to do with Uzbek monodix, it has to be reorganized cleaned. |
|||
Furthermore, there were new additions and some fixes to the Turkish monodix as well. |
|||
Another major mart of work accomplished during this project is the bilingual dictionary(bidix) of tur-uzb pair which has more than 12K translations now and passed 85% trimmed coverage on SETimes corpus. Lots of newly added entries in the bidix are from mostly-occurring words in the same corpus its trimmed coverage is being calculated. The remaining words are less frequent, but are still being planned to be entered in the future. |
|||
== Progress Table== |
|||
{|class=wikitable |
{| class="wikitable" |
||
|- |
|- |
||
! Lexicons |
|||
!colspan="2"|Week |
|||
! Lexicon entries |
|||
!colspan="2"|Stems |
|||
! Patterns |
|||
!colspan="2"|Tur-Uzb |
|||
! Pattern entries |
|||
!colspan="2"|Naïve Coverage |
|||
!colspan="2"|Progress |
|||
|- |
|||
! № |
|||
! Dates |
|||
! uzb |
|||
! tur-uzb |
|||
! WER |
|||
! PER |
|||
! uzb |
|||
! tur-uzb |
|||
!Evaluation |
|||
!Notes |
|||
|- |
|||
| 0 |
|||
| May 4 - May 31 |
|||
| 34375 |
|||
| 2412 |
|||
| 90.80 % |
|||
| 81.60 % |
|||
| 89.57 % |
|||
| 72.14 % |
|||
|Initial evaluation |
|||
| As of the end of May |
|||
|- |
|||
| 5 |
|||
| June 29 - July 5 |
|||
| 34373 |
|||
| 2445 |
|||
| 84.45 % |
|||
| 76.80 % |
|||
| 90.23 % |
|||
| 72.14 % |
|||
| First Evaluation |
|||
| End of June - ~July 3 |
|||
|- |
|||
| 9 |
|||
| July 27 - Aug 2 |
|||
| 34424 |
|||
| 4191 |
|||
| 78.70 % |
|||
| 68.34 % |
|||
| 90.23 % |
|||
| 72.74 % |
|||
| Second Evaluation |
|||
| As of July 31 - Aug 1 |
|||
|- |
|- |
||
| 20 |
|||
| 326 |
|||
| 1 |
|||
| 10 |
| 10 |
||
|} |
|||
| July 3 - Aug 9 |
|||
| 35621 |
|||
===ibo lexicon sizes After=== |
|||
| 5639 |
|||
| 78.70 % |
|||
{| class="wikitable" |
|||
| 68.64 % |
|||
| 90.28 % |
|||
| 80.14 % |
|||
| Weekly evaluation |
|||
| Week #10 |
|||
|- |
|- |
||
! Lexicons |
|||
| 11 |
|||
! Lexicon entries |
|||
| Aug 10 - Aug 16 |
|||
! Patterns |
|||
| 37649 |
|||
! Pattern entries |
|||
| 8154 |
|||
| 78.70 % |
|||
| 68.64 % |
|||
| 90.46 % |
|||
| 83.08 % |
|||
| Weekly evaluation |
|||
| Week #11 |
|||
|- |
|||
| 12 |
|||
| Aug 17 - Aug 23 |
|||
| 57406 |
|||
| 13023 |
|||
| 78.70 % |
|||
| 68.64 % |
|||
| 90.91 % |
|||
| 86.02 % |
|||
| Weekly evaluation |
|||
| Week #12 |
|||
|- |
|||
| 13 |
|||
| Aug 24 - Aug 30 |
|||
| 58757 |
|||
| 12861 |
|||
| 78.70 % |
|||
| 68.64 % |
|||
| 90.94 % |
|||
| 86.03 % |
|||
| Final evaluation |
|||
| As of Aug 31 |
|||
|- |
|- |
||
| 31 |
|||
| 949 |
|||
| 4 |
|||
| 20 |
|||
|} |
|} |
||
==Future Work== |
==Future Work== |
||
* Add more stems to ibo monolingual dictionary |
|||
* TESTVOC. Due to a lack of time at the end of the project, vocabulary testing was left unfinished. |
|||
* Add transfer rules, etc. |
|||
* LEXICON-OV-ICH, the proper lexical rule for Uzbek Cognomens and Patronyms where Cognomen is made as Antrponym+[o/e]v(a) and Patronym is made as Antrponym+[o/e]v[ich/na]. |
|||
* Improve work in eng-ibo bidix. |
|||
* Apertium-Separable, reordering separable/discontiguous multiword elements(MWE) has to be done by moving all MWEs to lsx file. |
|||
* Reordering and cleaning Uzbek monodix. It has some entries with wrong tags and lots of duplicate entries. |
|||
* Lexical selection rules. This also helps a lot to reduce WER. |
|||
==Conclusion== |
==Conclusion== |
||
It has been a great experience for me working with Apertium over the past |
It has been a great experience for me working with Apertium over the past ten weeks. I could get a solution or an explanation from the community to any obstacle I faced, I would like to thank the whole Apertium community, specifically, my mentors, Jonathan Washington, Mikel L. Forcada, and Nick Howell for their support, mentorship, and pointing me in the right direction |
Latest revision as of 15:47, 20 August 2021
Contents
Summary[edit]
The goal of this project was to Develop a morphological analyser for language pair for English-Igbo and write a usable version which provides intelligible output. After discussions with mentors, the best way to make the best out of Summer of Code, we decided to improve Ibo monolingual coverage package as much as possible.
Main Work[edit]
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/ifeanyijasper.html .
Most part of the work done on the ibo language was its monodix. This consisted of adding stems to dictionaries, I was able to expand coverage of the Igbo analyser from a prototype analyser to one with wide coverage (although still not production-ready)
ibo morphological analyser coverage[edit]
Corpus | Words | Coverage before | Coverage after |
---|---|---|---|
Wikipedia | 511550 | 19.09% | 68.52% |
ibo lexicon sizes Before[edit]
Lexicons | Lexicon entries | Patterns | Pattern entries |
---|---|---|---|
20 | 326 | 1 | 10 |
ibo lexicon sizes After[edit]
Lexicons | Lexicon entries | Patterns | Pattern entries |
---|---|---|---|
31 | 949 | 4 | 20 |
Future Work[edit]
- Add more stems to ibo monolingual dictionary
- Add transfer rules, etc.
- Improve work in eng-ibo bidix.
Conclusion[edit]
It has been a great experience for me working with Apertium over the past ten weeks. I could get a solution or an explanation from the community to any obstacle I faced, I would like to thank the whole Apertium community, specifically, my mentors, Jonathan Washington, Mikel L. Forcada, and Nick Howell for their support, mentorship, and pointing me in the right direction