Difference between revisions of "User:Eden/GSOC2020Proposal English-Swahili"

From Apertium
Jump to navigation Jump to search
(Created page with "== My goal == I’m planning to work on the ‘English-Swahili’ language pair.<br/> From last year's work on the eng-lin pair, there are 2 main areas I will improve on for t...")
 
 
Line 1: Line 1:
 
== My goal ==
 
== My goal ==
I’m planning to work on the ‘English-Swahili’ language pair.<br/>
+
Create a usable ‘English-Swahili’ language pair. <br/>
 
From last year's work on the eng-lin pair, there are 2 main areas I will improve on for this year: daily communication with my mentors and having enough Swahili language data.
 
From last year's work on the eng-lin pair, there are 2 main areas I will improve on for this year: daily communication with my mentors and having enough Swahili language data.
   
== (TODO) Why am I interested in Apertium? ==
+
== Why am I interested in Apertium? ==
Apertium is at the intersection of computers and languages, which are two of my passions.
+
Apertium sits at the intersection of computers and languages, which are two of my passions. Apertium, I believe, is the perfect platform to build translations tools for under-resourced languages. My primary focus is on Bantu languages, which can all be correctly classified as under-resourced. Using Apertium, allows me to create translation tools and dictionaries(more like digitizing paper dictionaries) for these languages.
   
   
 
== Who will benefit and why should it get sponsored ==
 
== Who will benefit and why should it get sponsored ==
   
African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Swahili, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build translators since massive amount of data for these languages do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.
+
African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Swahili, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build tools for these languages because massive amounts of data for these languages simply do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.<br/>
  +
Swahili is a Bantu language spoken mainly in Tanzania, Kenya, Uganda, DRC, Burundi, and Mozambique by well over 100 million people. <br/>
  +
Various translation tools exist for Swahili but they are mostly proprietary. This will be the first of its kind, open source translation tool for Swahili. Thus, providing the public with an open source solution for working with Swahili. <br/>
  +
In short, the project will result with the biggest, first of its kind, open source tool to work with Swahili(morphological analyzer, English-Swahili dictionary,..)
   
== (TODO) Swahili resources ==
+
== Swahili resources ==
 
Here is a list of ''open'' and ''public domain'' resources(dictionaries, grammar books, texts, etc) for Swahili:<br/>
 
Here is a list of ''open'' and ''public domain'' resources(dictionaries, grammar books, texts, etc) for Swahili:<br/>
   
  +
*Corpus/frequency list/bigram
== (TODO) Coding challenge ==
 
All my work are in 2 main repos: [https://github.com/thefreezer/apertium-swa apertium-swa] [https://github.com/thefreezer/apertium-swa-eng apertium-swa-eng] <br />
+
- [https://github.com/thefreezer/apertium-swa/blob/master/dev/wikipedia_corpus.txt ~7m word corpus](needs a little bit more work) <br/>
  +
- [http://crubadan.org/writingsystems An Crúbadán] <br />
  +
*Dictionary
  +
- [https://kamusi.org/swahili-english-wordlist-2008 Swa-Eng] and [https://kamusi.org/english-swahili-wordlist-2008 Eng-Swa]<br/>
  +
- [https://archive.org/details/swahilienglishdi00mada/page/n15/mode/2up Madan A.C.,1846], [https://archive.org/details/englishswahilid00madagoog/page/n13/mode/2up Madan A. C.,1902][https://archive.org/details/Swahili-englishDictionary/mode/2up Charles, W. R.]<br/>
  +
- [https://github.com/freedict/fd-dictionaries/tree/master/swh-eng Freedict] <br/>
  +
*Grammar rules
  +
- [https://en.wikipedia.org/wiki/Swahili_grammar Wikipedia's Grammar Rules]<br/>
  +
- [https://archive.org/details/swahiligrammarvo00burtiala/page/96/mode/2up Burt, A. E,1910] <br/>
  +
- [https://archive.org/details/ERIC_ED012888/page/n123/mode/2up Follome]<br/>
  +
- [https://archive.org/details/ERIC_ED012888/page/n123/mode/2up Steerie, Edward]<br/>
  +
*Other
  +
- [http://www.language-archives.org/language/swh Language Archive] <br/>
  +
- [https://wals.info/languoid/lect/wals_code_swa WALS] <br/>
   
== (TODO) Work plan ==
+
== Coding challenge ==
  +
- All my work are in 2 main repos: [https://github.com/thefreezer/apertium-swa apertium-swa] [https://github.com/thefreezer/apertium-swa-eng apertium-swa-eng] <br />
community bonding period
 
  +
- [https://github.com/apertium/apertium-swa-eng/pull/1 PR] on apertium-swa-eng(total rewrite) <br/>
- Swahili corpus from Wikipedia(Done)
 
  +
- All noun classes have already been correctly set up in the transducer <br/>
- Frequency List
 
  +
- Couple nouns in the transducer and bidix <br/>
- Work on transfer rules and CG
 
  +
- Goal is to start writing transfer rules from April 01 <br/>
   
  +
== Work plan ==
Week 1:
 
 
Community bonding period(May 4-June 1)
  +
- Clean wikipedia corpus
  +
- Continue work on transfer rules and WER < 50%(short story) in swa-eng dir
  +
- Extract data from dictionaries
  +
 
Week 1(June 1-7):
 
- adding nouns(from frequency list) in the lin transducer
 
- adding nouns(from frequency list) in the lin transducer
  +
- Add nouns (from frequency list) in the swa transducer
  +
- Work on vowels
  +
- Constraint grammar for nouns
  +
- Add verbs
   
Week 2:
+
Week 2(June 8-14):
 
- adding pronouns and adjectives in the swa transducer
 
- adding pronouns and adjectives in the swa transducer
  +
- Continue work on verbs
  +
- Reference: kaz and lin transducers
  +
- Add prepositions and pronouns, conjunctions
  +
- Work on numerals
  +
- CG for all the above
   
Week 3:
+
Week 3(June 15-21):
  +
- Regression testing
- polishing the transducer to give better analyses
 
  +
- Test and polish transducer(work on bi-grams)
  +
- Finish adding adverbs, conjunctions, prepositions, etc
  +
- Start work on bilingual dictionary
   
Week 4:
+
Week 4(June 22-28):
- transfer rules for nouns and adjectives(both directions)
+
- Add nouns and adjectives in bidix
  +
- Transfer rules for nouns and adjectives(both directions)
  +
- Disambiguation rules
   
* '''Deliverable #1''' ...
+
* '''Deliverable #1(June 29): Advanced Swahili transducer(>10k entries) with basic bilingual dictionary'''
   
Week 5:
+
Week 5(June 29-July 5):
- continue work on bilingual dictionary,
+
- Continue work on bidix: add nouns and verbs
  +
- Focus on verbs
  +
- Transfer rules from eng-lin, kaz-eng, and eng-fre
  +
- Transfer rules for verbs in both directions
   
Week 6:
+
Week 6(July 6-12):
- filling pronouns, adverbs, and others in the bidix
+
- Add pronouns and transfer rules for them
  +
- Add adverbs
  +
- Wok on compound Swahili words
  +
- Transfer rules for pronouns, adverbs and compound nouns(both directions)
   
Week 7:
+
Week 7(July 13-19):
  +
- Goal: well defined macros for verbs and pronouns
- adding determinants and more adjectives in the bidix
 
  +
- WER < 35% on 500 word story
  +
- add/polish rules for concordance between verbs and pronouns
   
Week 8:
+
Week 8(July 20-26):
- continue work on transfer rules in .t2x and t3x files
+
- Continue work on transfer rules
 
- Work on disambiguation rules
  +
- Lots of testing and improvements
  +
- WER < 30% in both directions on a 1,000-word story
   
* '''Deliverable #2''' ...
+
* '''Deliverable #2(July 3): Advanced bilingual dictionary(~15,000 words) and transfer rules''' ...
   
Week 9 :
+
Week 9(July 27-August 2) :
- continue work on disambiguation(both directions)
+
- Continue work on disambiguation(both directions)
  +
- Testvoc and improvements
  +
- Filling bidix
   
Week 10:
+
Week 10(August 3-9):
- work on transfer rules,
+
- Work on transfer rules
  +
- goal is WER ~30% on a story greater > 1000 words
   
Week 11:
+
Week 11(August 10-16):
- continue work on transfer rules and testing,
+
- Continue work on transfer rules and testing
  +
- Wikipedia article translations
  +
- Continue filling bidix
   
Week 12:
+
Week 12(August 17-23):
- filling bidix with miscellaneous words
+
- Continue filling bidix with miscellaneous words
  +
- Detailed analysis of work completed(wiki)
  +
- (if work done well, start working on new pairs)
  +
- Evaluation of results and documentation
   
  +
* '''Submit Code and Final Evaluations(August 24-31): WER < 30%(with ~20,000 words in bidix) in both directions on most texts
* '''Project completed''' ...
 
  +
'''
   
 
== Skills and qualifications ==
 
== Skills and qualifications ==

Latest revision as of 13:21, 31 March 2020

My goal[edit]

Create a usable ‘English-Swahili’ language pair.
From last year's work on the eng-lin pair, there are 2 main areas I will improve on for this year: daily communication with my mentors and having enough Swahili language data.

Why am I interested in Apertium?[edit]

Apertium sits at the intersection of computers and languages, which are two of my passions. Apertium, I believe, is the perfect platform to build translations tools for under-resourced languages. My primary focus is on Bantu languages, which can all be correctly classified as under-resourced. Using Apertium, allows me to create translation tools and dictionaries(more like digitizing paper dictionaries) for these languages.


Who will benefit and why should it get sponsored[edit]

African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Swahili, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build tools for these languages because massive amounts of data for these languages simply do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.
Swahili is a Bantu language spoken mainly in Tanzania, Kenya, Uganda, DRC, Burundi, and Mozambique by well over 100 million people.
Various translation tools exist for Swahili but they are mostly proprietary. This will be the first of its kind, open source translation tool for Swahili. Thus, providing the public with an open source solution for working with Swahili.
In short, the project will result with the biggest, first of its kind, open source tool to work with Swahili(morphological analyzer, English-Swahili dictionary,..)

Swahili resources[edit]

Here is a list of open and public domain resources(dictionaries, grammar books, texts, etc) for Swahili:

  • Corpus/frequency list/bigram

- ~7m word corpus(needs a little bit more work)
- An Crúbadán

  • Dictionary

- Swa-Eng and Eng-Swa
- Madan A.C.,1846, Madan A. C.,1902Charles, W. R.
- Freedict

  • Grammar rules

- Wikipedia's Grammar Rules
- Burt, A. E,1910
- Follome
- Steerie, Edward

  • Other

- Language Archive
- WALS

Coding challenge[edit]

- All my work are in 2 main repos: apertium-swa apertium-swa-eng
- PR on apertium-swa-eng(total rewrite)
- All noun classes have already been correctly set up in the transducer
- Couple nouns in the transducer and bidix
- Goal is to start writing transfer rules from April 01

Work plan[edit]

Community bonding period(May 4-June 1)

- Clean wikipedia corpus
- Continue work on transfer rules and WER < 50%(short story) in swa-eng dir
- Extract data from dictionaries
Week 1(June 1-7): 
- adding nouns(from frequency list) in the lin transducer
- Add nouns (from frequency list) in the swa transducer
- Work on vowels
- Constraint grammar for nouns
- Add verbs
Week 2(June 8-14):
- adding pronouns and adjectives in the swa transducer 
- Continue work on verbs
- Reference: kaz and lin transducers
- Add prepositions and pronouns, conjunctions
- Work on numerals
- CG for all the above
Week 3(June 15-21):  
- Regression testing
- Test and polish transducer(work on bi-grams)
- Finish adding adverbs, conjunctions, prepositions, etc
- Start work on bilingual dictionary
Week 4(June 22-28):  
- Add nouns and adjectives in bidix
- Transfer rules for nouns and adjectives(both directions)
- Disambiguation rules
  • Deliverable #1(June 29): Advanced Swahili transducer(>10k entries) with basic bilingual dictionary
Week 5(June 29-July 5):  
- Continue work on bidix: add nouns and verbs 
- Focus on verbs
- Transfer rules from eng-lin, kaz-eng, and eng-fre
- Transfer rules for verbs in both directions
Week 6(July 6-12):  
- Add pronouns and transfer rules for them
- Add adverbs
- Wok on compound Swahili words
- Transfer rules for pronouns, adverbs and compound nouns(both directions)
Week 7(July 13-19): 
- Goal: well defined macros for verbs and pronouns
- WER < 35% on 500 word story
- add/polish rules for concordance between verbs and pronouns
Week 8(July 20-26): 
- Continue work on transfer rules
- Work on disambiguation rules
- Lots of testing and improvements
- WER < 30% in both directions on a 1,000-word story
  • Deliverable #2(July 3): Advanced bilingual dictionary(~15,000 words) and transfer rules ...
Week 9(July 27-August 2) :
- Continue work on disambiguation(both directions)
- Testvoc and improvements
- Filling bidix
Week 10(August 3-9):
- Work on transfer rules
- goal is WER ~30% on a story greater > 1000 words
Week 11(August 10-16):
- Continue work on transfer rules and testing
- Wikipedia article translations
- Continue filling bidix
Week 12(August 17-23):
- Continue filling bidix with miscellaneous words
- Detailed analysis of work completed(wiki)
- (if work done well, start working on new pairs)
- Evaluation of results and documentation
  • Submit Code and Final Evaluations(August 24-31): WER < 30%(with ~20,000 words in bidix) in both directions on most texts

Skills and qualifications[edit]

Ongoing major: second year Computer Science students with a minor in Math
Relevant technical skills: python, c/c++, sql(intermediate), git(intermediate), bash(intermediate), html5/css3(advanced)
Languages: French(native), Lingala(native), English(Fluent) , Swahili(proficient), Tshiluba(proficient), Twi(elementary)

Non-Summer-of-Code plans[edit]

None.