Difference between revisions of "User:Eden"

From Apertium
Jump to navigation Jump to search
Line 21: Line 21:
African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Lingala, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build translators since massive amount of data for these languages do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.
African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Lingala, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build translators since massive amount of data for these languages do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.


Lingala is a Bantu Language, mainly used as a lingua franca, in central Africa(mainly in the Democtratic Republic of Congo and to some extent in Angola and the Republic of Congo) with over 70 million speakers(https://en.wikipedia.org/wiki/Lingua_franca). Developing an English-Lingala pair will, I believe, positevely contribute to the technological and economic development of these underserved places. Hopefully this translator will serve a lot of people and organizations. From Wikipedia contributors, to casual users, and to other open source software that might need a Lingala translator.
Lingala is a Bantu Language, mainly used as a lingua franca, in central Africa(mainly in the Democratic Republic of Congo and to some extent in Angola and the Republic of Congo) with over 70 million speakers(https://en.wikipedia.org/wiki/Lingua_franca). Developing an English-Lingala pair will, I believe, positively contribute to the technological and economic development of these underserved places. Hopefully this translator will serve a lot of people and organizations. From Wikipedia contributors, to casual users, and to other open source software that might need a Lingala translator.

== Lingala resources ==
Given that Lingala is mostly a spoken language, there isn't al
Here is a list of ''open'' and ''public domain'' resources(dictionaries, grammar books, texts, etc) for the Lingala language:<br/>
- [http://crubadan.org/languages/ln Crubadan text corpus] A text corpus sorted by word frequency<br/>
- The excellent, [https://archive.org/details/suggestionsforgr00stap Grammar and dictionary of Bangala] <br/>
- [http://unicode.org/udhr/d/udhr_lin_tones.html Universal Declaration of Human Rights - Lingala (tones)] <br/>
- [https://archive.org/details/ERIC_ED294440/page/n189 Lingala. Livre du formatteur] Lingala teacher's manual (I will have to confirm if this book is in the public domain)<br/>
- [https://archive.org/details/rosettaproject_lin_gen-1 Bible] and [https://archive.org/details/TranslationOfTheMeaningOfTheNobleQuranInTheLINGALABANTULanguageHQJUZZAMMA/page/n25 Quran] can be used as parallel texts.<br/>
- [https://archive.org/details/NotionsDeLingala/page/n29 Notions de Lingala] - Another dictionary plus common Lingala sentences <br/>


== Coding challenge ==
== Coding challenge ==
All my work is in my repo: https://github.com/thefreezer/GSOC-apertium-eng-lin <br />
1. Installed Apertium tools<br />
'''Update 1'''<br/>
2. All my work is in my repo: https://github.com/thefreezer/GSOC-apertium-eng-lin <br />
1. Added ~95% of all words from this [https://sourceforge.net/p/apertium/svn/HEAD/tree/branches/xupaixkar/rasskaz/ story]. <br/>
I will add a couple more rules and macros.
2. From the ''493''-word story, my final translation has ''74'' unknown words(*) and ''63'' words with the wrong final form(#). Most of them are verbs, adj and adv. Original story is [https://github.com/thefreezer/GSOC-apertium-eng-lin/blob/master/story.txt here] and [https://github.com/thefreezer/GSOC-apertium-eng-lin/blob/master/output.txt here] is the final output.<br/>
3. Added 8 rules which give me correct translations for:<br/>
* prn/np vblex/vbhaver/vbser det n (eg. I see a house) with correct present and past(saw) verb tenses
* prn/np vblex/vbhaver/vbser pr det adj n(eg. Mary eats in the beautiful garden)
* and other rules for dealing with the infitive form of a verb, and handling the [https://en.wikipedia.org/wiki/Pro-drop_language pro-drop] behavior of the language.
I will try to implement a rule for dealing with the future tense(eg. I will play ...)<br/>
''Note: a lot of these rules are inspired from the eng-fra pair''


== Work plan ==
== Work plan ==
(this page will frequently change as I get more familiar with Apertium)
(this page will frequently change as I get more familiar with Apertium)
* community bonding period : reading more about transfer-rules and creating a doc for Lingala rules
community bonding period
- reading more about transfer-rules and creating a doc for Lingala rules
* Week 1: adding stems to transducer
- build a better frequency list of Lingala words
* Week 2: work on pronouns and adding adjectives
- reading more about the HFST
* Week 3: filling nouns and adjectives in bilingual dictionary, regression testing

* Week 4: transfer rules for nouns and adjectives
Week 1:
- adding nouns(from frequency list) in the lin transducer
- adding verbs in the lin transducer
- constraint grammar

Week 2:
- adding pronouns and adjectives in the lin transducer
- adding adverbs, conjunctions, prepositions, etc
- constraint grammar for prn and adj

Week 3:
- polishing the transducer to give better analyses(eg.
- filling nouns and adjectives in bilingual dictionary,
- regression testing

Week 4:
- transfer rules for nouns and adjectives
- disambiguation rules


* '''Deliverable #1''' Advanced Lingala transducer with basic bilingual dictionary
* '''Deliverable #1''' Advanced Lingala transducer with basic bilingual dictionary


Week 5:
* Week 5: continue work on bilingual dictionary, filling verbs
- continue work on bilingual dictionary,
* Week 6: filling pronouns, adverbs, and others
- main work will be on verbs
* Week 7: transfer rules for verbs, pronouns, determinants, and adverbs, and others
- transfer rules for verbs
* Week 8: work on disambiguation, lots of testing and improvement of bilingual dictionary(WER < 50%)

Week 6:
- filling pronouns, adverbs, and others in the bidix
- work on compound Lingala words

Week 7:
- adding pronouns, determinants, adverbs, and others in the bidix
- WER < 25% on a 500 word story
- add/polish rules for concordance between verbs and pronouns

Week 8:
- continue work on transfer rules in .t2x and t3x files
- work on disambiguation,
- lots of testing and improvement of bilingual dictionary
- WER < 25% on a 1000 word story


* '''Deliverable #2''' Advanced bilingual dictionary and transfer rules
* '''Deliverable #2''' Advanced bilingual dictionary and transfer rules


Week 9 :
* Week 9 : continue work on disambiguation
- continue work on disambiguation
* Week 10: work on transfer rules, testvoc. goal is WER < 40%(is this achievable?)
- testvoc and improvements
* Week 11: continue work on transfer rules and testing, wikipedia transalations

* Week 12: detailed analysis of work completed(wiki), evaluation of results and documentation
Week 10:
- work on transfer rules,
- goal is WER < 20% on a story greater > 1000 words(is this achievable?)

Week 11:
- continue work on transfer rules and testing,
- Wikipedia article translations

Week 12:
- detailed analysis of work completed(wiki),
- evaluation of results and documentation


* '''Project completed''' Goal is to have a WER < 35%
* '''Project completed''' Goal is to have a WER < ~25% on most texts


== Skills and qualifications ==
== Skills and qualifications ==
Line 56: Line 117:
Relevant technical skills: python(online data mining, inferential statistics, numpy, pandas, matplotlib), c++(proficient), sql(elementary), git(proficient), bash(proficient), html5/css3(advanced)<br />
Relevant technical skills: python(online data mining, inferential statistics, numpy, pandas, matplotlib), c++(proficient), sql(elementary), git(proficient), bash(proficient), html5/css3(advanced)<br />
Work experience: as an intern created static and dynamic websites<br />
Work experience: as an intern created static and dynamic websites<br />
Languages: French(native), English(native), Lingala(Fluent), Swahili(proficient), Tshiluba(proficient), Twi(elementary)<br />
Languages: French(native), Lingala(native), English(Fluent) , Swahili(proficient), Tshiluba(proficient), Twi(elementary)<br />


== Non-Summer-of-Code plans ==
== Non-Summer-of-Code plans ==

Revision as of 03:03, 4 April 2019


Contact Information

Name: Eden-Grace Muamba
Location: Alberta, Canada
University: University of Alberta
E-mail: nzambieden@gmail.com
IRC: eden__
Timezone: UTC -7
Github: https://github.com/thefreezer

My goal

I’m planning to start the ‘English-Lingala’ language pair.

Why am I interested in Apertium?

Apertium is at the intersection of computers and languages, which are two of my passions. This will be my first ever contribution to an open source project. For the short amount of time I have been on the IRC and the mailing list, the Apertium community has made it a fun and enjoyable experience for me. I hope, not only to develop an English-Lingala pair but also, to become a long-time contributor to Apertium, mainly by creating new English/French-African Language pairs.

Who will benefit and why should it get sponsored

African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Lingala, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build translators since massive amount of data for these languages do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.

Lingala is a Bantu Language, mainly used as a lingua franca, in central Africa(mainly in the Democratic Republic of Congo and to some extent in Angola and the Republic of Congo) with over 70 million speakers(https://en.wikipedia.org/wiki/Lingua_franca). Developing an English-Lingala pair will, I believe, positively contribute to the technological and economic development of these underserved places. Hopefully this translator will serve a lot of people and organizations. From Wikipedia contributors, to casual users, and to other open source software that might need a Lingala translator.

Lingala resources

Given that Lingala is mostly a spoken language, there isn't al Here is a list of open and public domain resources(dictionaries, grammar books, texts, etc) for the Lingala language:
- Crubadan text corpus A text corpus sorted by word frequency
- The excellent, Grammar and dictionary of Bangala
- Universal Declaration of Human Rights - Lingala (tones)
- Lingala. Livre du formatteur Lingala teacher's manual (I will have to confirm if this book is in the public domain)
- Bible and Quran can be used as parallel texts.
- Notions de Lingala - Another dictionary plus common Lingala sentences

Coding challenge

All my work is in my repo: https://github.com/thefreezer/GSOC-apertium-eng-lin
Update 1
1. Added ~95% of all words from this story.
2. From the 493-word story, my final translation has 74 unknown words(*) and 63 words with the wrong final form(#). Most of them are verbs, adj and adv. Original story is here and here is the final output.
3. Added 8 rules which give me correct translations for:

  • prn/np vblex/vbhaver/vbser det n (eg. I see a house) with correct present and past(saw) verb tenses
  • prn/np vblex/vbhaver/vbser pr det adj n(eg. Mary eats in the beautiful garden)
  • and other rules for dealing with the infitive form of a verb, and handling the pro-drop behavior of the language.

I will try to implement a rule for dealing with the future tense(eg. I will play ...)
Note: a lot of these rules are inspired from the eng-fra pair

Work plan

(this page will frequently change as I get more familiar with Apertium)

community bonding period 
- reading more about transfer-rules and creating a doc for Lingala rules
- build a better frequency list of Lingala words
- reading more about the HFST
Week 1: 
- adding nouns(from frequency list) in the lin transducer
- adding verbs in the lin transducer
- constraint grammar
Week 2:
- adding pronouns and adjectives in the lin transducer 
- adding adverbs, conjunctions, prepositions, etc
- constraint grammar for prn and adj
Week 3:  
- polishing the transducer to give better analyses(eg. 
- filling nouns and adjectives in bilingual dictionary, 
- regression testing
Week 4:  
- transfer rules for nouns and adjectives
- disambiguation rules
  • Deliverable #1 Advanced Lingala transducer with basic bilingual dictionary
Week 5:  
- continue work on bilingual dictionary,
- main work will be on verbs
- transfer rules for verbs
Week 6:  
- filling pronouns, adverbs, and others in the bidix
- work on compound Lingala words
Week 7: 
- adding pronouns, determinants, adverbs, and others in the bidix
- WER < 25% on a 500 word story
- add/polish rules for concordance between verbs and pronouns
Week 8: 
- continue work on transfer rules in .t2x and t3x files
- work on disambiguation, 
- lots of testing and improvement of bilingual dictionary
- WER < 25% on a 1000 word story
  • Deliverable #2 Advanced bilingual dictionary and transfer rules
Week 9 :
- continue work on disambiguation
- testvoc and improvements
Week 10:
- work on transfer rules, 
- goal is WER < 20% on a story greater > 1000 words(is this achievable?)
Week 11:
- continue work on transfer rules and testing, 
- Wikipedia article translations
Week 12:
- detailed analysis of work completed(wiki),
- evaluation of results and documentation
  • Project completed Goal is to have a WER < ~25% on most texts

Skills and qualifications

Ongoing major: first year Computer Science students with a minor in Statistics
Relevant technical skills: python(online data mining, inferential statistics, numpy, pandas, matplotlib), c++(proficient), sql(elementary), git(proficient), bash(proficient), html5/css3(advanced)
Work experience: as an intern created static and dynamic websites
Languages: French(native), Lingala(native), English(Fluent) , Swahili(proficient), Tshiluba(proficient), Twi(elementary)

Non-Summer-of-Code plans

Traveling to Ontario for 5 days from June 29, but that will not affect my work. I’m committed to put it at least 40+ hours a week for the duration of the project.