Difference between revisions of "User:Ifeanyi/GSoC2021 Final Report"

From Apertium
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
==Summary==
 
==Summary==
This project started with a proposal initially named as "State-of-the-art Morphological Analayser for Uzbek language and improved language pairs uz-kk, uz-ky, uz-tr". After discussions with mentors, the best path to make the best of Summer of Code, we decided to cover the Uzbek monolingual package as much as possible together with the Turkish-Uzbek translation pair.
+
The goal of this project was to Develop a morphological analyser for language pair for English-Igbo and write a usable version which provides intelligible output. After discussions with mentors, the best way to make the best out of Summer of Code, we decided to improve Ibo monolingual coverage package as much as possible.
   
  +
==Main Work==
To calculate the coverage of the Uzbek(apertium-uzb) analyser, Uzbek Wikipedia data from 20.05.2020 date with 136K articles(around 13M tokens) was chosen. As for the calculation of trimmed coverage(coverage of a pair limited to the words in the dictionary) of Turkish-Uzbek(apertium-tur-uzb) translation pair, Southeast European Times(SETimes) website data collection in Turkish was used(around 3.7M tokens).
 
In order to calculate word error rate(WER) and position-independent word error rate (PER) of the tur-uzb pair, a parallel text corpora had been created and "James and Mary Story"(~40 sentences) was chosen in our case.
 
 
There are still many tasks that have to be finished, such as creating tests for vocabulary(aka Testvoc) and more lexical selection rules(see #Future Work)
 
 
Overall, there has been a lot of work on both Uzbek monolingual and Turkish-Uzbek translation packages. Obtained results indicate that goals set initially for Coverage have been met, yet WER/PER results have to be improved.
 
 
==Repos==
 
All the contributions can be found at following repositories:
 
* Apertium Turkish monolingual package:
 
** https://github.com/apertium/apertium-tur
 
* Apertium Uzbek monolingual package:
 
** https://github.com/apertium/apertium-uzb
 
* Apertium Turkish-Uzbek translation package:
 
** https://github.com/apertium/apertium-tur-uzb
 
   
 
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/ifeanyijasper.html .
 
Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/ifeanyijasper.html .
   
  +
Most part of the work done on the ibo language was its monodix. This consisted of adding stems to dictionaries, I was able to expand coverage of the Igbo analyser from a prototype analyser to one with wide coverage (although still not production-ready)
I have to point out that there are still some more Pull-Requests that haven't been merged yet.
 
Such as these PRs:
 
* https://github.com/apertium/apertium-tur/pull/5
 
* https://github.com/apertium/apertium-uzb/pull/11
 
* https://github.com/apertium/apertium-tur-uzb/pull/4
 
   
  +
===ibo morphological analyser coverage===
==Main Work==
 
   
  +
{| class="wikitable"
Most part of the work done on the Uzbek language was its monodix, which reached more than 55K stems and above 90% coverage on Uzbek Wikipedia. Additional to newly added entries, those entries with wrong tags have been fixed too. There is still a bit work to do with Uzbek monodix, it has to be reorganized cleaned.
 
  +
|-
Furthermore, there were new additions and some fixes to the Turkish monodix as well.
 
  +
! Corpus
  +
! Words
  +
! Coverage before
  +
! Coverage after
  +
|-
  +
| Wikipedia
  +
| 511550
  +
| 19.09%
  +
| 68.52%
  +
|}
   
Another major mart of work accomplished during this project is the bilingual dictionary(bidix) of tur-uzb pair which has more than 12K translations now and passed 85% trimmed coverage on SETimes corpus. Lots of newly added entries in the bidix are from mostly-occurring words in the same corpus its trimmed coverage is being calculated. The remaining words are less frequent, but are still being planned to be entered in the future.
 
   
  +
===ibo lexicon sizes Before===
== Progress Table==
 
   
{|class=wikitable style="text-align: center;"
+
{| class="wikitable"
 
|-
 
|-
  +
! Lexicons
!colspan="2"|Week
 
  +
! Lexicon entries
!colspan="2"|Stems
 
  +
! Patterns
!colspan="2"|Tur-Uzb
 
  +
! Pattern entries
!colspan="2"|Naïve Coverage
 
!colspan="2"|Progress
 
|-
 
! №
 
! Dates
 
! uzb
 
! tur-uzb
 
! WER
 
! PER
 
! uzb
 
! tur-uzb
 
!Evaluation
 
!Notes
 
|-
 
| 0
 
| May 4 - May 31
 
| 34375
 
| 2412
 
| 90.80 %
 
| 81.60 %
 
| 89.57 %
 
| 72.14 %
 
|Initial evaluation
 
| As of the end of May
 
|-
 
| 5
 
| June 29 - July 5
 
| 34373
 
| 2445
 
| 84.45 %
 
| 76.80 %
 
| 90.23 %
 
| 72.14 %
 
| First Evaluation
 
| End of June - ~July 3
 
|-
 
| 9
 
| July 27 - Aug 2
 
| 34424
 
| 4191
 
| 78.70 %
 
| 68.34 %
 
| 90.23 %
 
| 72.74 %
 
| Second Evaluation
 
| As of July 31 - Aug 1
 
 
|-
 
|-
  +
| 20
  +
| 326
  +
| 1
 
| 10
 
| 10
  +
|}
| July 3 - Aug 9
 
  +
| 35621
 
  +
===ibo lexicon sizes After===
| 5639
 
  +
| 78.70 %
 
  +
{| class="wikitable"
| 68.64 %
 
| 90.28 %
 
| 80.14 %
 
| Weekly evaluation
 
| Week #10
 
 
|-
 
|-
  +
! Lexicons
| 11
 
  +
! Lexicon entries
| Aug 10 - Aug 16
 
  +
! Patterns
| 37649
 
  +
! Pattern entries
| 8154
 
| 78.70 %
 
| 68.64 %
 
| 90.46 %
 
| 83.08 %
 
| Weekly evaluation
 
| Week #11
 
|-
 
| 12
 
| Aug 17 - Aug 23
 
| 57406
 
| 13023
 
| 78.70 %
 
| 68.64 %
 
| 90.91 %
 
| 86.02 %
 
| Weekly evaluation
 
| Week #12
 
|-
 
| 13
 
| Aug 24 - Aug 30
 
| 58757
 
| 12861
 
| 78.70 %
 
| 68.64 %
 
| 90.94 %
 
| 86.03 %
 
| Final evaluation
 
| As of Aug 31
 
 
|-
 
|-
  +
| 31
  +
| 949
  +
| 4
  +
| 20
 
|}
 
|}
   
 
==Future Work==
 
==Future Work==
  +
* Add more stems to ibo monolingual dictionary
* TESTVOC. Due to a lack of time at the end of the project, vocabulary testing was left unfinished.
 
  +
* Add transfer rules, etc.
* LEXICON-OV-ICH, the proper lexical rule for Uzbek Cognomens and Patronyms where Cognomen is made as Antrponym+[o/e]v(a) and Patronym is made as Antrponym+[o/e]v[ich/na].
 
  +
* Improve work in eng-ibo bidix.
* Apertium-Separable, reordering separable/discontiguous multiword elements(MWE) has to be done by moving all MWEs to lsx file.
 
  +
* Reordering and cleaning Uzbek monodix. It has some entries with wrong tags and lots of duplicate entries.
 
* Lexical selection rules. This also helps a lot to reduce WER.
 
   
 
==Conclusion==
 
==Conclusion==
It has been a great experience for me working with Apertium over the past three months. I could get a solution or an explanation from the community to any obstacle I faced, special thanks to @Firespeaker and @Piraye for always fixing my issues and pointing me in the right direction. I hope to finish all necessaries and see this pair out soon. Planning to work with Apertium on more projects in the future.
+
It has been a great experience for me working with Apertium over the past ten weeks. I could get a solution or an explanation from the community to any obstacle I faced, I would like to thank the whole Apertium community, specifically, my mentors, Jonathan Washington, Mikel L. Forcada, and Nick Howell for their support, mentorship, and pointing me in the right direction

Latest revision as of 15:47, 20 August 2021

Summary[edit]

The goal of this project was to Develop a morphological analyser for language pair for English-Igbo and write a usable version which provides intelligible output. After discussions with mentors, the best way to make the best out of Summer of Code, we decided to improve Ibo monolingual coverage package as much as possible.

Main Work[edit]

Most of the work that had been collected at the end of GSoC program can be found here : https://apertium.projectjj.com/gsoc2021/ifeanyijasper.html .

Most part of the work done on the ibo language was its monodix. This consisted of adding stems to dictionaries, I was able to expand coverage of the Igbo analyser from a prototype analyser to one with wide coverage (although still not production-ready)

ibo morphological analyser coverage[edit]

Corpus Words Coverage before Coverage after
Wikipedia 511550 19.09% 68.52%


ibo lexicon sizes Before[edit]

Lexicons Lexicon entries Patterns Pattern entries
20 326 1 10

ibo lexicon sizes After[edit]

Lexicons Lexicon entries Patterns Pattern entries
31 949 4 20

Future Work[edit]

  • Add more stems to ibo monolingual dictionary
  • Add transfer rules, etc.
  • Improve work in eng-ibo bidix.


Conclusion[edit]

It has been a great experience for me working with Apertium over the past ten weeks. I could get a solution or an explanation from the community to any obstacle I faced, I would like to thank the whole Apertium community, specifically, my mentors, Jonathan Washington, Mikel L. Forcada, and Nick Howell for their support, mentorship, and pointing me in the right direction