User:Eden/GSOC2020 Swahili-Lingala

From Apertium
Jump to navigation Jump to search


Create a usable ‘Swahili-Lingala’ language pair.

Swahili and Lingala resources[edit]

Here is a list of open and public domain resources(dictionaries, grammar books, texts, etc) for Swahili:

  • Corpus/frequency list/bigram

- ~7m word corpus(needs a little bit more work)
- An Crúbadán

  • Dictionary

- Swa-Eng and Eng-Swa
- Madan A.C.,1846, Madan A. C.,1902Charles, W. R.
- Freedict

  • Grammar rules

- Wikipedia's Grammar Rules
- Burt, A. E,1910
- Follome
- Steerie, Edward

  • Other

- Language Archive

Work plan[edit]

Community bonding period(May 4-June 1)

- See User:Eden/GSoC_progress
Week 1(June 1-7): 
- adding nouns(from frequency list) in the lin transducer
- Add nouns (from frequency list) in the swa transducer
- Work on vowels
- Constraint grammar for nouns
- Add verbs
Week 2(June 8-14):
- adding pronouns and adjectives in the swa transducer 
- Continue work on verbs
- Reference: kaz and lin transducers
- Add prepositions and pronouns, conjunctions
- Work on numerals
- CG for all the above
Week 3(June 15-21):  
- Regression testing
- Test and polish transducer(work on bi-grams)
- Finish adding adverbs, conjunctions, prepositions, etc
- Start work on bilingual dictionary
Week 4(June 22-28):  
- Add nouns and adjectives in bidix
- Transfer rules for nouns and adjectives(both directions)
- Disambiguation rules
  • Deliverable #1(June 29): Advanced Swahili transducer(>10k entries) with basic bilingual dictionary
Week 5(June 29-July 5):  
- Continue work on bidix: add nouns and verbs 
- Focus on verbs
- Transfer rules from eng-lin, kaz-eng, and eng-fre
- Transfer rules for verbs in both directions
Week 6(July 6-12):  
- Add pronouns and transfer rules for them
- Add adverbs
- Wok on compound Swahili words
- Transfer rules for pronouns, adverbs and compound nouns(both directions)
Week 7(July 13-19): 
- Goal: well defined macros for verbs and pronouns
- WER < 35% on 500 word story
- add/polish rules for concordance between verbs and pronouns
Week 8(July 20-26): 
- Continue work on transfer rules
- Work on disambiguation rules
- Lots of testing and improvements
- WER < 30% in both directions on a 1,000-word story
  • Deliverable #2(July 3): Advanced bilingual dictionary(~15,000 words) and transfer rules ...
Week 9(July 27-August 2) :
- Continue work on disambiguation(both directions)
- Testvoc and improvements
- Filling bidix
Week 10(August 3-9):
- Work on transfer rules
- goal is WER ~30% on a story greater > 1000 words
Week 11(August 10-16):
- Continue work on transfer rules and testing
- Wikipedia article translations
- Continue filling bidix
Week 12(August 17-23):
- Continue filling bidix with miscellaneous words
- Detailed analysis of work completed(wiki)
- (if work done well, start working on new pairs)
- Evaluation of results and documentation
  • Submit Code and Final Evaluations(August 24-31): WER < 30%(with ~20,000 words in bidix) in both directions on most texts