Difference between revisions of "Indonesian and Malaysian/Work plan"
Jump to navigation
Jump to search
Line 3: | Line 3: | ||
{|class=wikitable |
{|class=wikitable |
||
|- |
|- |
||
! Week !! Dates !! Main activities !! Coverage reached (wp) !! Evaluation !! WER reached |
! Week !! Dates !! Main activities !! Coverage reached (wp) !! Trimmed coverage reached (wp) !! Testvoc clean !! Evaluation !! WER reached |
||
|- |
|- |
||
| 0 || <s>23/04—21/05</s> || Translating the story to get a baseline WER. || |
| 0 || <s>23/04—21/05</s> || Translating the story to get a baseline WER. || || || || 500 words || 4.68% (id->ms) |
||
|- |
|- |
||
| 1 || <s>21/05—27/05</s> || Working on Indonesian analyzer/generator. || |
| 1 || <s>21/05—27/05</s> || Working on Indonesian analyzer/generator. || || || || || |
||
|- |
|- |
||
| 2 || <s>28/05—03/06</s> || Working on Indonesian analyzer/generator.<br/>Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus.<br/>Bilingual dictionaries will be [[Extracting bilingual dictionaries with Giza++|extracted]] from the corpus. || 72.9%, 29.9% || |
| 2 || <s>28/05—03/06</s> || Working on Indonesian analyzer/generator.<br/>Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus.<br/>Bilingual dictionaries will be [[Extracting bilingual dictionaries with Giza++|extracted]] from the corpus. || 72.9%, 29.9% || || || || |
||
|- |
|- |
||
| 3 || <s>04/06—10/06</s> || Translating Malaysian wikipedia articles to Indonesian.<br/>Working on Malaysian analyzer/generator. || 74.9%, 46.4% || |
| 3 || <s>04/06—10/06</s> || Translating Malaysian wikipedia articles to Indonesian.<br/>Working on Malaysian analyzer/generator. || 74.9%, 46.4% || || || || |
||
|- |
|- |
||
| 4 || <s>11/06—17/06</s> || Working on Malaysian analyzer/generator and bidix. || 75.6%, 72.9% || |
| 4 || <s>11/06—17/06</s> || Working on Malaysian analyzer/generator and bidix. || 75.6%, 72.9% || || || || |
||
|- |
|- |
||
| 5 || <s>18/06—24/06</s> || Working on Malaysian analyzer/generator and bidix. || 80.1%, 77.5% || 300 words || 2.97% (ms->id) |
| 5 || <s>18/06—24/06</s> || Working on Malaysian analyzer/generator and bidix. || 80.1%, 77.5% || || || 300 words || 2.97% (ms->id) |
||
|- |
|- |
||
| 6 || 25/06—01/07 || Working on bidix. || || || |
| 6 || 25/06—01/07 || Working on bidix. || 80.1%, || 73.3%, || <code><ij> <cnjcoo> <cnjsub> <cnjadv> <det> <pr> <num> <prn> <np> <adv></code>|| || |
||
|} |
|} |
||
Revision as of 12:10, 2 July 2012
This is a workplan for development efforts for the Indonesian and Malaysian translator in Google Summer of Code 2012.
Week | Dates | Main activities | Coverage reached (wp) | Trimmed coverage reached (wp) | Testvoc clean | Evaluation | WER reached |
---|---|---|---|---|---|---|---|
0 | Translating the story to get a baseline WER. | 500 words | 4.68% (id->ms) | ||||
1 | Working on Indonesian analyzer/generator. | ||||||
2 | Working on Indonesian analyzer/generator. Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus. Bilingual dictionaries will be extracted from the corpus. |
72.9%, 29.9% | |||||
3 | Translating Malaysian wikipedia articles to Indonesian. Working on Malaysian analyzer/generator. |
74.9%, 46.4% | |||||
4 | Working on Malaysian analyzer/generator and bidix. | 75.6%, 72.9% | |||||
5 | Working on Malaysian analyzer/generator and bidix. | 80.1%, 77.5% | 300 words | 2.97% (ms->id) | |||
6 | 25/06—01/07 | Working on bidix. | 80.1%, | 73.3%, | <ij> <cnjcoo> <cnjsub> <cnjadv> <det> <pr> <num> <prn> <np> <adv> |
Ideas for getting Indonesian-Malaysian bilingual dictionaries
- Filtering the Indonesian lemma list. For each lemma, check whether they are also valid Malaysian words.
- Interlanguage wiki links.
- Extracting bilingual dictionaries from parallel corpus.
Todo list
Convert the Malaysian dictionary to Apertium formatCreate a script to get Indonesian word listAdding missing words from the storyAdding conjunctives and interjections- Assigning correct parameter which will be reduplicated, for verbs with meN- (id)
- Passive form for verbs with meN- (id) (Done: V -> V no suffix, no suffix + -kan; N -> V -kan)
- ter-, se-, peN-an, per-an
- Alternative POS for each word
- diper-, ber-an, ber-kan
- Check from the inflected and derived form, whether the lemma has been added as a separate entry
- ke-an variations -> better tag naming