Indonesian and Malaysian/Work plan
Jump to navigation
Jump to search
This is a workplan for development efforts for the Indonesian and Malaysian translator in Google Summer of Code 2012.
Week | Dates | Main activities | Coverage reached (wp) | Evaluation | WER reached |
---|---|---|---|---|---|
0 | 23/04—21/05 | Translating the story to get a baseline WER. | - | 500 words | 4.68% (id->ms) |
1 | 21/05—27/05 | Working on Indonesian analyzer/generator. | - | - | - |
2 | 28/05—03/06 | Working on Indonesian analyzer/generator. Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus. Bilingual dictionaries will be extracted from the corpus. |
72.9%, 29.9% | - | - |
3 | 04/06—10/06 | Translating Malaysian wikipedia articles to Indonesian. Working on Malaysian analyzer/generator. |
74.9%, 46.4% | - | - |
4 | 11/06—17/06 | Working on Malaysian analyzer/generator and bidix. | 75.6%, 72.9% | - | - |
5 | 18/06—24/06 | Working on Malaysian analyzer/generator and bidix. | 80.1%, 77.5% | 300 words | 2.97% (id->ms) |
6 | 25/06—1/07 | Working on bidix. |
Ideas for getting Indonesian-Malaysian bilingual dictionaries
- Filtering the Indonesian lemma list. For each lemma, check whether they are also valid Malaysian words.
- Interlanguage wiki links.
- Extracting bilingual dictionaries from parallel corpus.
Todo list
Convert the Malaysian dictionary to Apertium formatCreate a script to get Indonesian word listAdding missing words from the storyAdding conjunctives and interjections- Assigning correct parameter which will be reduplicated, for verbs with meN- (id)
- Passive form for verbs with meN- (id) (Done: V -> V no suffix, no suffix + -kan; N -> V -kan)
- ter-, se-, peN-an, per-an
- Alternative POS for each word
- diper-, ber-an, ber-kan
- Check from the inflected and derived form, whether the lemma has been added as a separate entry
- ke-an variations -> better tag naming