User:Eden/GSOC2020Proposal English-Swahili

From Apertium
Jump to navigation Jump to search

My goal[edit]

Create a usable ‘English-Swahili’ language pair.
From last year's work on the eng-lin pair, there are 2 main areas I will improve on for this year: daily communication with my mentors and having enough Swahili language data.

Why am I interested in Apertium?[edit]

Apertium sits at the intersection of computers and languages, which are two of my passions. Apertium, I believe, is the perfect platform to build translations tools for under-resourced languages. My primary focus is on Bantu languages, which can all be correctly classified as under-resourced. Using Apertium, allows me to create translation tools and dictionaries(more like digitizing paper dictionaries) for these languages.


Who will benefit and why should it get sponsored[edit]

African languages are poorly represented in Apertium and even other commercially available options are usually quite lacking. Given that Swahili, and most African languages do not always have a lot of digitized content accessible, it's hard to use any machine learning or NLP tools to build tools for these languages because massive amounts of data for these languages simply do not exist. In such cases, a rule-based MT tool like Apertium becomes the most viable option.
Swahili is a Bantu language spoken mainly in Tanzania, Kenya, Uganda, DRC, Burundi, and Mozambique by well over 100 million people.
Various translation tools exist for Swahili but they are mostly proprietary. This will be the first of its kind, open source translation tool for Swahili. Thus, providing the public with an open source solution for working with Swahili.
In short, the project will result with the biggest, first of its kind, open source tool to work with Swahili(morphological analyzer, English-Swahili dictionary,..)

Swahili resources[edit]

Here is a list of open and public domain resources(dictionaries, grammar books, texts, etc) for Swahili:

  • Corpus/frequency list/bigram

- ~7m word corpus(needs a little bit more work)
- An Crúbadán

  • Dictionary

- Swa-Eng and Eng-Swa
- Madan A.C.,1846, Madan A. C.,1902Charles, W. R.
- Freedict

  • Grammar rules

- Wikipedia's Grammar Rules
- Burt, A. E,1910
- Follome
- Steerie, Edward

  • Other

- Language Archive
- WALS

Coding challenge[edit]

- All my work are in 2 main repos: apertium-swa apertium-swa-eng
- PR on apertium-swa-eng(total rewrite)
- All noun classes have already been correctly set up in the transducer
- Couple nouns in the transducer and bidix
- Goal is to start writing transfer rules from April 01

Work plan[edit]

Community bonding period(May 4-June 1)

- Clean wikipedia corpus
- Continue work on transfer rules and WER < 50%(short story) in swa-eng dir
- Extract data from dictionaries
Week 1(June 1-7): 
- adding nouns(from frequency list) in the lin transducer
- Add nouns (from frequency list) in the swa transducer
- Work on vowels
- Constraint grammar for nouns
- Add verbs
Week 2(June 8-14):
- adding pronouns and adjectives in the swa transducer 
- Continue work on verbs
- Reference: kaz and lin transducers
- Add prepositions and pronouns, conjunctions
- Work on numerals
- CG for all the above
Week 3(June 15-21):  
- Regression testing
- Test and polish transducer(work on bi-grams)
- Finish adding adverbs, conjunctions, prepositions, etc
- Start work on bilingual dictionary
Week 4(June 22-28):  
- Add nouns and adjectives in bidix
- Transfer rules for nouns and adjectives(both directions)
- Disambiguation rules
  • Deliverable #1(June 29): Advanced Swahili transducer(>10k entries) with basic bilingual dictionary
Week 5(June 29-July 5):  
- Continue work on bidix: add nouns and verbs 
- Focus on verbs
- Transfer rules from eng-lin, kaz-eng, and eng-fre
- Transfer rules for verbs in both directions
Week 6(July 6-12):  
- Add pronouns and transfer rules for them
- Add adverbs
- Wok on compound Swahili words
- Transfer rules for pronouns, adverbs and compound nouns(both directions)
Week 7(July 13-19): 
- Goal: well defined macros for verbs and pronouns
- WER < 35% on 500 word story
- add/polish rules for concordance between verbs and pronouns
Week 8(July 20-26): 
- Continue work on transfer rules
- Work on disambiguation rules
- Lots of testing and improvements
- WER < 30% in both directions on a 1,000-word story
  • Deliverable #2(July 3): Advanced bilingual dictionary(~15,000 words) and transfer rules ...
Week 9(July 27-August 2) :
- Continue work on disambiguation(both directions)
- Testvoc and improvements
- Filling bidix
Week 10(August 3-9):
- Work on transfer rules
- goal is WER ~30% on a story greater > 1000 words
Week 11(August 10-16):
- Continue work on transfer rules and testing
- Wikipedia article translations
- Continue filling bidix
Week 12(August 17-23):
- Continue filling bidix with miscellaneous words
- Detailed analysis of work completed(wiki)
- (if work done well, start working on new pairs)
- Evaluation of results and documentation
  • Submit Code and Final Evaluations(August 24-31): WER < 30%(with ~20,000 words in bidix) in both directions on most texts

Skills and qualifications[edit]

Ongoing major: second year Computer Science students with a minor in Math
Relevant technical skills: python, c/c++, sql(intermediate), git(intermediate), bash(intermediate), html5/css3(advanced)
Languages: French(native), Lingala(native), English(Fluent) , Swahili(proficient), Tshiluba(proficient), Twi(elementary)

Non-Summer-of-Code plans[edit]

None.