User:Prondubuisi/GSOC 2020 proposal:English-Igbo pair

From Apertium
Jump to navigation Jump to search


Contact Information

Name: Ndubuisi Onyemenam
Location: Owerri, Nigeria
University: Imo State University, Owerri
E-mail: onyemenamndu@gmail.com
IRC: Prondubuisi
Timezone: UTC +1
Github: https://github.com/prondubuisi
Linkedin: https://www.linkedin.com/in/onyemenamndu/

My goal

My goal is to work on the already existing Apertium english-igbo pair and bring it to a standard where it is usable. This work will be bi-directional. I also hope to generate interest in the Apertium project for Nigerian/African languages during the course of my work.


Why am I interested in Apertium?

Open Source projects have always been my go to place for skilling up as well as accessing curated and crowd sourced information. Apertium through her english-igbo translation project offers me a chance to;

- improve my Language skills(Igbo and English)

- understand how machine translation works

- Contribute my quota to the sustenance of my Native language(Igbo)

- Lead an initiative to get other Nigerian/African language and open source enthusiasts to get involved in immortalizing our native languages

Who will benefit and why should it get sponsored

Most available English to Igbo translators hardly do a good job, as their is just little data available for feeding machine learning models for these translation(Access to data is a huge problem in most parts of Africa). On the other hand Apertium's rule-based Machine translation model will be a very handy alternative. Being open source and free this also makes it easy for native speakers and other language users to help improve translations.

The benefit of this native Igbo speakers looking to communicate in english will be able to, this will also be the case for english speakers who want to communicate in English. This project will also go a long way in documenting the igbo langauage, whic has been named an endangered Language by UNESCO. It is worthy of note that Igbo is spoken by over 27 million people, and there is a strong presence of Igbos in Different parts of Europe and America. This project will help kids born to Igbo's in diaspora learn the native tongue.

The ripple effect for this will be, increased cultural exchange, as technological and economic improvements. Would also make it easy for persons looking to translate information and learning materials from english to igbo and vice versa.



Lingala resources

Here is a list of open and public domain resources(dictionaries, grammar books, texts, etc) for the Lingala language:
- Crubadan text corpus A text corpus sorted by word frequency
- The excellent, Grammar and dictionary of Bangala
- Universal Declaration of Human Rights - Lingala (tones)
- Lingala. Livre du formatteur Lingala teacher's manual (I will have to confirm if this book is in the public domain)
- Bible and Quran can be used as parallel texts.
- Notions de Lingala - Another dictionary plus common Lingala sentences

Coding challenge

All my work is in my repo: https://github.com/thefreezer/GSOC-apertium-eng-lin
Update 1: Apr/1/19
1. Added ~95% of all words from this story.
2. From the 493-word story, my final translation has 74 unknown words(*) and 63 words with the wrong final form(#). Most of them are verbs, adj and adv. Original story is here and here is the final output.
(eng-lin) 3. Added 8 rules which give me correct translations for:

  • prn/np vblex/vbhaver/vbser det n (eg. I see a house) with correct present and past(saw) verb tenses
  • prn/np vblex/vbhaver/vbser pr det adj n(eg. Mary eats in the beautiful garden)
  • and other rules for dealing with the infitive form of a verb, and handling the pro-drop behavior of the language.

I will try to implement a rule for dealing with the future tense(eg. I will play ...)
Note: a lot of these rules are inspired from the eng-fra pair

It would be good to be able to evaluate WER, so a correct Lingala version of the story would be very useful —Firespeaker (talk) 03:50, 8 April 2019 (CEST).

Update 2: Apr/7/19
1. Added the full Lingala translation here.
2. lin-eng: 75.27% WER
3. eng-lin: 85.65% WER (I mostly focused on lin-eng, which explains why this direction is higher)

Work plan

community bonding period 
- reading more about transfer-rules and creating a doc for eng-lin lin-eng rules
- build a better frequency list of Lingala words
- reading more about the HFST
- continue work and achieve a WER<50% from at least one direction
Week 1: 
- adding nouns(from frequency list) in the lin transducer
- adding verbs with correct tenses in the lin transducer
- constraint grammar
Week 2:
- adding pronouns and adjectives in the lin transducer 
- also adding adverbs, conjunctions, prepositions, etc
- constraint grammar for prn and adj
Week 3:  
- polishing the transducer to give better analyses
- filling nouns and adjectives in bilingual dictionary, 
- regression testing
Week 4:  
- transfer rules for nouns and adjectives(both directions)
- disambiguation rules
  • Deliverable #1 Advanced Lingala transducer with basic bilingual dictionary
Week 5:  
- continue work on bilingual dictionary,
- main work will be on verbs
- transfer rules for verbs in both directions
Week 6:  
- filling pronouns, adverbs, and others in the bidix
- work on compound Lingala words
- transfer rules for pronouns, adverbs and compound nouns(both directions)
Week 7: 
- adding determinants and more adjectives in the bidix
- WER < 35% on a 500 word story
- add/polish rules for concordance between verbs and pronouns
Week 8: 
- continue work on transfer rules in .t2x and t3x files
- work on disambiguation(eng-lin, lin-eng) 
- lots of testing and improvement of bilingual dictionary
- WER < 30% in both directions on a 1000 word story
  • Deliverable #2 Advanced bilingual dictionary(~5,000 words) and transfer rules
Week 9 :
- continue work on disambiguation(both directions)
- testvoc and improvements
- filling bidix(common nouns)
Week 10:
- work on transfer rules, 
- goal is WER < 30% on a story greater > 1000 words(is this achievable?)
Week 11:
- continue work on transfer rules and testing, 
- Wikipedia article translations
- continue filling bidix
Week 12:
- filling bidix with miscellaneous words 
- detailed analysis of work completed(wiki),
- evaluation of results and documentation
  • Project completed WER~30%(with ~7,000 words in bidix) in both directions on most texts

Skills and qualifications

Ongoing major: first year Computer Science students with a minor in Statistics
Relevant technical skills: python(online data mining, inferential statistics, numpy, pandas, matplotlib), c++(elementary), sql(intermediate), git(intermediate), bash(intermediate), html5/css3(advanced)
Work experience: as an intern created static and dynamic websites
Languages: French(native), Lingala(native), English(Fluent) , Swahili(proficient), Tshiluba(proficient), Twi(elementary)

Non-Summer-of-Code plans

Traveling to Ontario for 5 days from June 29, but that will not affect my work. No other commitments which will allow me to put it at least 40+ hours a week for the duration of the project.