Difference between revisions of "User:Eden/GSOC2020Proposal English-Swahili"
(Created page with "== My goal == I’m planning to work on the ‘English-Swahili’ language pair.<br/> From last year's work on the eng-lin pair, there are 2 main areas I will improve on for t...") |
|||
Line 1: | Line 1: | ||
== My goal == |
== My goal == |
||
Create a usable ‘English-Swahili’ language pair. <br/> |
|||
From last year's work on the eng-lin pair, there are 2 main areas I will improve on for this year: daily communication with my mentors and having enough Swahili language data. |
From last year's work on the eng-lin pair, there are 2 main areas I will improve on for this year: daily communication with my mentors and having enough Swahili language data. |
||
== |
== Why am I interested in Apertium? == |
||
Apertium |
Apertium sits at the intersection of computers and languages, which are two of my passions. Apertium, I believe, is the perfect platform to build translations tools for under-resourced languages. My primary focus is on Bantu languages, which can all be correctly classified as under-resourced. Using Apertium, allows me to create translation tools and dictionaries(more like digitizing paper dictionaries) for these languages. |
||
== Who will benefit and why should it get sponsored == |
== Who will benefit and why should it get sponsored == |
||
African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Swahili, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build |
African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Swahili, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build tools for these languages because massive amounts of data for these languages simply do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.<br/> |
||
Swahili is a Bantu language spoken mainly in Tanzania, Kenya, Uganda, DRC, Burundi, and Mozambique by well over 100 million people. <br/> |
|||
Various translation tools exist for Swahili but they are mostly proprietary. This will be the first of its kind, open source translation tool for Swahili. Thus, providing the public with an open source solution for working with Swahili. <br/> |
|||
In short, the project will result with the biggest, first of its kind, open source tool to work with Swahili(morphological analyzer, English-Swahili dictionary,..) |
|||
== |
== Swahili resources == |
||
Here is a list of ''open'' and ''public domain'' resources(dictionaries, grammar books, texts, etc) for Swahili:<br/> |
Here is a list of ''open'' and ''public domain'' resources(dictionaries, grammar books, texts, etc) for Swahili:<br/> |
||
*Corpus/frequency list/bigram |
|||
== (TODO) Coding challenge == |
|||
- [https://github.com/thefreezer/apertium-swa/blob/master/dev/wikipedia_corpus.txt ~7m word corpus](needs a little bit more work) <br/> |
|||
- [http://crubadan.org/writingsystems An Crúbadán] <br /> |
|||
*Dictionary |
|||
- [https://kamusi.org/swahili-english-wordlist-2008 Swa-Eng] and [https://kamusi.org/english-swahili-wordlist-2008 Eng-Swa]<br/> |
|||
- [https://archive.org/details/swahilienglishdi00mada/page/n15/mode/2up Madan A.C.,1846], [https://archive.org/details/englishswahilid00madagoog/page/n13/mode/2up Madan A. C.,1902][https://archive.org/details/Swahili-englishDictionary/mode/2up Charles, W. R.]<br/> |
|||
- [https://github.com/freedict/fd-dictionaries/tree/master/swh-eng Freedict] <br/> |
|||
*Grammar rules |
|||
- [https://en.wikipedia.org/wiki/Swahili_grammar Wikipedia's Grammar Rules]<br/> |
|||
- [https://archive.org/details/swahiligrammarvo00burtiala/page/96/mode/2up Burt, A. E,1910] <br/> |
|||
- [https://archive.org/details/ERIC_ED012888/page/n123/mode/2up Follome]<br/> |
|||
- [https://archive.org/details/ERIC_ED012888/page/n123/mode/2up Steerie, Edward]<br/> |
|||
*Other |
|||
- [http://www.language-archives.org/language/swh Language Archive] <br/> |
|||
- [https://wals.info/languoid/lect/wals_code_swa WALS] <br/> |
|||
== |
== Coding challenge == |
||
- All my work are in 2 main repos: [https://github.com/thefreezer/apertium-swa apertium-swa] [https://github.com/thefreezer/apertium-swa-eng apertium-swa-eng] <br /> |
|||
⚫ | |||
- [https://github.com/apertium/apertium-swa-eng/pull/1 PR] on apertium-swa-eng(total rewrite) <br/> |
|||
- Swahili corpus from Wikipedia(Done) |
|||
- All noun classes have already been correctly set up in the transducer <br/> |
|||
- Frequency List |
|||
- Couple nouns in the transducer and bidix <br/> |
|||
⚫ | |||
- Goal is to start writing transfer rules from April 01 <br/> |
|||
== Work plan == |
|||
⚫ | |||
⚫ | |||
- Clean wikipedia corpus |
|||
- Continue work on transfer rules and WER < 50%(short story) in swa-eng dir |
|||
- Extract data from dictionaries |
|||
⚫ | |||
- adding nouns(from frequency list) in the lin transducer |
- adding nouns(from frequency list) in the lin transducer |
||
- Add nouns (from frequency list) in the swa transducer |
|||
- Work on vowels |
|||
- Constraint grammar for nouns |
|||
- Add verbs |
|||
Week 2: |
Week 2(June 8-14): |
||
- adding pronouns and adjectives in the swa transducer |
- adding pronouns and adjectives in the swa transducer |
||
- Continue work on verbs |
|||
- Reference: kaz and lin transducers |
|||
- Add prepositions and pronouns, conjunctions |
|||
- Work on numerals |
|||
- CG for all the above |
|||
Week 3: |
Week 3(June 15-21): |
||
- Regression testing |
|||
- polishing the transducer to give better analyses |
|||
- Test and polish transducer(work on bi-grams) |
|||
- Finish adding adverbs, conjunctions, prepositions, etc |
|||
- Start work on bilingual dictionary |
|||
Week 4: |
Week 4(June 22-28): |
||
- |
- Add nouns and adjectives in bidix |
||
- Transfer rules for nouns and adjectives(both directions) |
|||
- Disambiguation rules |
|||
* '''Deliverable #1''' |
* '''Deliverable #1(June 29): Advanced Swahili transducer(>10k entries) with basic bilingual dictionary''' |
||
Week 5: |
Week 5(June 29-July 5): |
||
- |
- Continue work on bidix: add nouns and verbs |
||
- Focus on verbs |
|||
- Transfer rules from eng-lin, kaz-eng, and eng-fre |
|||
- Transfer rules for verbs in both directions |
|||
Week 6: |
Week 6(July 6-12): |
||
- |
- Add pronouns and transfer rules for them |
||
- Add adverbs |
|||
- Wok on compound Swahili words |
|||
- Transfer rules for pronouns, adverbs and compound nouns(both directions) |
|||
Week 7: |
Week 7(July 13-19): |
||
- Goal: well defined macros for verbs and pronouns |
|||
- adding determinants and more adjectives in the bidix |
|||
- WER < 35% on 500 word story |
|||
- add/polish rules for concordance between verbs and pronouns |
|||
Week 8: |
Week 8(July 20-26): |
||
- |
- Continue work on transfer rules |
||
⚫ | |||
- Lots of testing and improvements |
|||
- WER < 30% in both directions on a 1,000-word story |
|||
* '''Deliverable #2''' ... |
* '''Deliverable #2(July 3): Advanced bilingual dictionary(~15,000 words) and transfer rules''' ... |
||
Week 9 : |
Week 9(July 27-August 2) : |
||
- |
- Continue work on disambiguation(both directions) |
||
- Testvoc and improvements |
|||
- Filling bidix |
|||
Week 10: |
Week 10(August 3-9): |
||
- |
- Work on transfer rules |
||
- goal is WER ~30% on a story greater > 1000 words |
|||
Week 11: |
Week 11(August 10-16): |
||
- |
- Continue work on transfer rules and testing |
||
- Wikipedia article translations |
|||
- Continue filling bidix |
|||
Week 12: |
Week 12(August 17-23): |
||
- filling bidix with miscellaneous words |
- Continue filling bidix with miscellaneous words |
||
- Detailed analysis of work completed(wiki) |
|||
- (if work done well, start working on new pairs) |
|||
- Evaluation of results and documentation |
|||
* '''Submit Code and Final Evaluations(August 24-31): WER < 30%(with ~20,000 words in bidix) in both directions on most texts |
|||
* '''Project completed''' ... |
|||
''' |
|||
== Skills and qualifications == |
== Skills and qualifications == |
Latest revision as of 13:21, 31 March 2020
Contents
My goal[edit]
Create a usable ‘English-Swahili’ language pair.
From last year's work on the eng-lin pair, there are 2 main areas I will improve on for this year: daily communication with my mentors and having enough Swahili language data.
Why am I interested in Apertium?[edit]
Apertium sits at the intersection of computers and languages, which are two of my passions. Apertium, I believe, is the perfect platform to build translations tools for under-resourced languages. My primary focus is on Bantu languages, which can all be correctly classified as under-resourced. Using Apertium, allows me to create translation tools and dictionaries(more like digitizing paper dictionaries) for these languages.
Who will benefit and why should it get sponsored[edit]
African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Swahili, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build tools for these languages because massive amounts of data for these languages simply do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.
Swahili is a Bantu language spoken mainly in Tanzania, Kenya, Uganda, DRC, Burundi, and Mozambique by well over 100 million people.
Various translation tools exist for Swahili but they are mostly proprietary. This will be the first of its kind, open source translation tool for Swahili. Thus, providing the public with an open source solution for working with Swahili.
In short, the project will result with the biggest, first of its kind, open source tool to work with Swahili(morphological analyzer, English-Swahili dictionary,..)
Swahili resources[edit]
Here is a list of open and public domain resources(dictionaries, grammar books, texts, etc) for Swahili:
- Corpus/frequency list/bigram
- ~7m word corpus(needs a little bit more work)
- An Crúbadán
- Dictionary
- Swa-Eng and Eng-Swa
- Madan A.C.,1846, Madan A. C.,1902Charles, W. R.
- Freedict
- Grammar rules
- Wikipedia's Grammar Rules
- Burt, A. E,1910
- Follome
- Steerie, Edward
- Other
- Language Archive
- WALS
Coding challenge[edit]
- All my work are in 2 main repos: apertium-swa apertium-swa-eng
- PR on apertium-swa-eng(total rewrite)
- All noun classes have already been correctly set up in the transducer
- Couple nouns in the transducer and bidix
- Goal is to start writing transfer rules from April 01
Work plan[edit]
Community bonding period(May 4-June 1)
- Clean wikipedia corpus - Continue work on transfer rules and WER < 50%(short story) in swa-eng dir - Extract data from dictionaries
Week 1(June 1-7): - adding nouns(from frequency list) in the lin transducer - Add nouns (from frequency list) in the swa transducer - Work on vowels - Constraint grammar for nouns - Add verbs
Week 2(June 8-14): - adding pronouns and adjectives in the swa transducer - Continue work on verbs - Reference: kaz and lin transducers - Add prepositions and pronouns, conjunctions - Work on numerals - CG for all the above
Week 3(June 15-21): - Regression testing - Test and polish transducer(work on bi-grams) - Finish adding adverbs, conjunctions, prepositions, etc - Start work on bilingual dictionary
Week 4(June 22-28): - Add nouns and adjectives in bidix - Transfer rules for nouns and adjectives(both directions) - Disambiguation rules
- Deliverable #1(June 29): Advanced Swahili transducer(>10k entries) with basic bilingual dictionary
Week 5(June 29-July 5): - Continue work on bidix: add nouns and verbs - Focus on verbs - Transfer rules from eng-lin, kaz-eng, and eng-fre - Transfer rules for verbs in both directions
Week 6(July 6-12): - Add pronouns and transfer rules for them - Add adverbs - Wok on compound Swahili words - Transfer rules for pronouns, adverbs and compound nouns(both directions)
Week 7(July 13-19): - Goal: well defined macros for verbs and pronouns - WER < 35% on 500 word story - add/polish rules for concordance between verbs and pronouns
Week 8(July 20-26): - Continue work on transfer rules - Work on disambiguation rules - Lots of testing and improvements - WER < 30% in both directions on a 1,000-word story
- Deliverable #2(July 3): Advanced bilingual dictionary(~15,000 words) and transfer rules ...
Week 9(July 27-August 2) : - Continue work on disambiguation(both directions) - Testvoc and improvements - Filling bidix
Week 10(August 3-9): - Work on transfer rules - goal is WER ~30% on a story greater > 1000 words
Week 11(August 10-16): - Continue work on transfer rules and testing - Wikipedia article translations - Continue filling bidix
Week 12(August 17-23): - Continue filling bidix with miscellaneous words - Detailed analysis of work completed(wiki) - (if work done well, start working on new pairs) - Evaluation of results and documentation
- Submit Code and Final Evaluations(August 24-31): WER < 30%(with ~20,000 words in bidix) in both directions on most texts
Skills and qualifications[edit]
Ongoing major: second year Computer Science students with a minor in Math
Relevant technical skills: python, c/c++, sql(intermediate), git(intermediate), bash(intermediate), html5/css3(advanced)
Languages: French(native), Lingala(native), English(Fluent) , Swahili(proficient), Tshiluba(proficient), Twi(elementary)
Non-Summer-of-Code plans[edit]
None.