Difference between revisions of "Indonesian and Malaysian/Work plan"
Jump to navigation
Jump to search
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This is a workplan for development efforts for the [[Indonesian and Malaysian]] translator in [[Google Summer of Code]] 2012. |
This is a workplan for development efforts for the [[Indonesian and Malaysian]] translator in [[Google Summer of Code]] 2012. |
||
==Work plan== |
|||
{|class=wikitable |
{|class=wikitable |
||
|- |
|- |
||
! Week !! Dates !! |
! Week !! Dates !! Main activities !! Coverage reached (wp) !! Trimmed coverage reached (wp) !! Testvoc clean !! Evaluation !! WER reached |
||
|- |
|- |
||
| 0 || 23/04—21/05 || Translating the story to get a baseline WER. || |
| 0 || <s>23/04—21/05</s> || Translating the story to get a baseline WER. || || || || 500 words || 4.68% (id->ms) |
||
|- |
|- |
||
| 1 || 21/05—27/05 || Working on Indonesian analyzer/generator. || |
| 1 || <s>21/05—27/05</s> || Working on Indonesian analyzer/generator. || || || || || |
||
|- |
|- |
||
| 2 || 28/05—03/06 || Working on Indonesian analyzer/generator.<br/>Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus. |
| 2 || <s>28/05—03/06</s> || Working on Indonesian analyzer/generator.<br/>Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus.<br/>Bilingual dictionaries will be [[Extracting bilingual dictionaries with Giza++|extracted]] from the corpus. || 72.9%, 29.9% || || || || |
||
|- |
|- |
||
| 3 || 04/06—10/06 || Translating Malaysian wikipedia articles to Indonesian.<br/>Working on Malaysian analyzer/generator. || 74.9%, 46.4% || |
| 3 || <s>04/06—10/06</s> || Translating Malaysian wikipedia articles to Indonesian.<br/>Working on Malaysian analyzer/generator. || 74.9%, 46.4% || || || || |
||
|- |
|- |
||
| 4 || 11/06—17/06 || Working on Malaysian analyzer/generator and bidix. || || |
| 4 || <s>11/06—17/06</s> || Working on Malaysian analyzer/generator and bidix. || 75.6%, 72.9% || || || || |
||
|- |
|||
| 5 || <s>18/06—24/06</s> || Working on Malaysian analyzer/generator and bidix. || 80.1%, 77.5% || || || 300 words || 2.97% (ms->id) |
|||
|- |
|||
| 6 || <s>25/06—01/07</s> || Working on bidix. || 80.1%, - || 73.3%, - || <code><ij> <cnjcoo> <cnjsub> <cnjadv> <det> <pr> <num> <prn> <np> <adv></code>|| || |
|||
|- |
|||
| 7 || <s>02/07—08/07</s> || Working on bidix. || 80.3%, 77.1% || 76.5%, 74.6% || || 500 words || 24.34% (ms->id) |
|||
|- |
|||
| 8 || <s>09/07—15/07</s> || Parallel corpus development. || || || || || |
|||
|- |
|||
| 9 || <s>16/07—22/07</s> || Working on bidix. || || || || || |
|||
|- |
|||
| 10 || <s>23/07—29/07</s> || A little break during this period. || || || || || |
|||
|- |
|||
| 11 || <s>30/07—5/08</s> || Working on bidix. || || || || || |
|||
|- |
|||
| 12 || <s>6/08—12/08</s> || Cleaning up. || 80.7%, 80.1% || 80.7%, 80.1% || ''all categories clean'' || 2,000 words || 14.43% (id->ms), 7.58% (ms->id) |
|||
|} |
|} |
||
Line 21: | Line 39: | ||
# Interlanguage wiki links. |
# Interlanguage wiki links. |
||
# Extracting bilingual dictionaries from parallel corpus. |
# Extracting bilingual dictionaries from parallel corpus. |
||
<!-- |
|||
==Todo list== |
==Todo list== |
||
# <s>Convert the Malaysian dictionary to Apertium format</s> |
# <s>Convert the Malaysian dictionary to Apertium format</s> |
||
Line 32: | Line 50: | ||
# Alternative POS for each word |
# Alternative POS for each word |
||
# diper-, ber-an, ber-kan |
# diper-, ber-an, ber-kan |
||
# Check from the inflected and derived form, whether the lemma has been added as a separate entry |
|||
# ke-an variations -> better tag naming |
|||
--> |
|||
==See also== |
|||
* [[Building dictionaries]] |
|||
* [[Extracting bilingual dictionaries with Giza++]] |
|||
* [[Generating lexical-selection rules from a parallel corpus]] |
|||
==External links== |
==External links== |
||
Line 38: | Line 64: | ||
* [http://kateglo.bahtera.org/api.php kateglo's website] |
* [http://kateglo.bahtera.org/api.php kateglo's website] |
||
* [http://opus.lingfil.uu.se/ OPUS project] |
* [http://opus.lingfil.uu.se/ OPUS project] |
||
* [http://wiki.apertium.org/wiki/Building_dictionaries#Getting_cheap_bilingual_dictionary_entries Getting cheap bilingual dictionary entries] |
|||
* [http://wiki.apertium.org/wiki/Extracting_bilingual_dictionaries_with_Giza%2B%2B Extracting bilingual dictionaries with Giza++] |
|||
[[Category:Indonesian and Malaysian]] |
[[Category:Indonesian and Malaysian]] |
Latest revision as of 17:05, 22 August 2012
This is a workplan for development efforts for the Indonesian and Malaysian translator in Google Summer of Code 2012.
Contents
Work plan[edit]
Week | Dates | Main activities | Coverage reached (wp) | Trimmed coverage reached (wp) | Testvoc clean | Evaluation | WER reached |
---|---|---|---|---|---|---|---|
0 | Translating the story to get a baseline WER. | 500 words | 4.68% (id->ms) | ||||
1 | Working on Indonesian analyzer/generator. | ||||||
2 | Working on Indonesian analyzer/generator. Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus. Bilingual dictionaries will be extracted from the corpus. |
72.9%, 29.9% | |||||
3 | Translating Malaysian wikipedia articles to Indonesian. Working on Malaysian analyzer/generator. |
74.9%, 46.4% | |||||
4 | Working on Malaysian analyzer/generator and bidix. | 75.6%, 72.9% | |||||
5 | Working on Malaysian analyzer/generator and bidix. | 80.1%, 77.5% | 300 words | 2.97% (ms->id) | |||
6 | Working on bidix. | 80.1%, - | 73.3%, - | <ij> <cnjcoo> <cnjsub> <cnjadv> <det> <pr> <num> <prn> <np> <adv> |
|||
7 | Working on bidix. | 80.3%, 77.1% | 76.5%, 74.6% | 500 words | 24.34% (ms->id) | ||
8 | Parallel corpus development. | ||||||
9 | Working on bidix. | ||||||
10 | A little break during this period. | ||||||
11 | Working on bidix. | ||||||
12 | Cleaning up. | 80.7%, 80.1% | 80.7%, 80.1% | all categories clean | 2,000 words | 14.43% (id->ms), 7.58% (ms->id) |
Ideas for getting Indonesian-Malaysian bilingual dictionaries[edit]
- Filtering the Indonesian lemma list. For each lemma, check whether they are also valid Malaysian words.
- Interlanguage wiki links.
- Extracting bilingual dictionaries from parallel corpus.
See also[edit]
- Building dictionaries
- Extracting bilingual dictionaries with Giza++
- Generating lexical-selection rules from a parallel corpus