Difference between revisions of "Indonesian and Malaysian/Work plan"

From Apertium
Jump to navigation Jump to search
 
(32 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is a workplan for development efforts for the [[Indonesian and Malaysian]] translator in [[Google Summer of Code]] 2012.
==Community bonding period (April 24 - May 21)==

===Todo list===
==Work plan==
# Convert the Malaysian dictionary to Apertium format

# Create a script to get Indonesian and Malaysian word lists
{|class=wikitable
# Adding missing words from the story
|-
# Adding closed categories (conjunctions and interjections)
! Week !! Dates !! Main activities !! Coverage reached (wp) !! Trimmed coverage reached (wp) !! Testvoc clean !! Evaluation !! WER reached
===Resources for the word list===
|-
| 0 || <s>23/04&mdash;21/05</s> || Translating the story to get a baseline WER. || || || || 500 words || 4.68% (id->ms)
|-
| 1 || <s>21/05&mdash;27/05</s> || Working on Indonesian analyzer/generator. || || || || ||
|-
| 2 || <s>28/05&mdash;03/06</s> || Working on Indonesian analyzer/generator.<br/>Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus.<br/>Bilingual dictionaries will be [[Extracting bilingual dictionaries with Giza++|extracted]] from the corpus. || 72.9%, 29.9% || || || ||
|-
| 3 || <s>04/06&mdash;10/06</s> || Translating Malaysian wikipedia articles to Indonesian.<br/>Working on Malaysian analyzer/generator. || 74.9%, 46.4% || || || ||
|-
| 4 || <s>11/06&mdash;17/06</s> || Working on Malaysian analyzer/generator and bidix. || 75.6%, 72.9% || || || ||
|-
| 5 || <s>18/06&mdash;24/06</s> || Working on Malaysian analyzer/generator and bidix. || 80.1%, 77.5% || || || 300 words || 2.97% (ms->id)
|-
| 6 || <s>25/06&mdash;01/07</s> || Working on bidix. || 80.1%, - || 73.3%, - || <code><ij> <cnjcoo> <cnjsub> <cnjadv> <det> <pr> <num> <prn> <np> <adv></code>|| ||
|-
| 7 || <s>02/07&mdash;08/07</s> || Working on bidix. || 80.3%, 77.1% || 76.5%, 74.6% || || 500 words || 24.34% (ms->id)
|-
| 8 || <s>09/07&mdash;15/07</s> || Parallel corpus development. || || || || ||
|-
| 9 || <s>16/07&mdash;22/07</s> || Working on bidix. || || || || ||
|-
| 10 || <s>23/07&mdash;29/07</s> || A little break during this period. || || || || ||
|-
| 11 || <s>30/07&mdash;5/08</s> || Working on bidix. || || || || ||
|-
| 12 || <s>6/08&mdash;12/08</s> || Cleaning up. || 80.7%, 80.1% || 80.7%, 80.1% || ''all categories clean'' || 2,000 words || 14.43% (id->ms), 7.58% (ms->id)
|}

==Ideas for getting Indonesian-Malaysian bilingual dictionaries==

# Filtering the Indonesian lemma list. For each lemma, check whether they are also valid Malaysian words.
# Interlanguage wiki links.
# Extracting bilingual dictionaries from parallel corpus.
<!--
==Todo list==
# <s>Convert the Malaysian dictionary to Apertium format</s>
# <s>Create a script to get Indonesian word list</s>
# <s>Adding missing words from the story</s>
# <s>Adding conjunctives and interjections</s>
# Assigning correct parameter which will be reduplicated, for verbs with meN- (id)
# Passive form for verbs with meN- (id) (Done: V -> V no suffix, no suffix + -kan; N -> V -kan)
# ter-, se-, peN-an, per-an
# Alternative POS for each word
# diper-, ber-an, ber-kan
# Check from the inflected and derived form, whether the lemma has been added as a separate entry
# ke-an variations -> better tag naming
-->

==See also==
* [[Building dictionaries]]
* [[Extracting bilingual dictionaries with Giza++]]
* [[Generating lexical-selection rules from a parallel corpus]]

==External links==
* [http://pusatbahasa.kemdiknas.go.id/kbbi/ KBBI Daring]
* [http://pusatbahasa.kemdiknas.go.id/kbbi/ KBBI Daring]
* [http://prpm.dbp.gov.my/ PRPM's website]
* [http://dictionary.bhanot.net/index.html Dr Bhanot's Malay-English Dictionary] (10,000 words)
* [http://kateglo.bahtera.org/api.php kateglo's website]
* [http://opus.lingfil.uu.se/ OPUS project]


[[Category:Indonesian and Malaysian|*]]
[[Category:Indonesian and Malaysian]]

Latest revision as of 17:05, 22 August 2012

This is a workplan for development efforts for the Indonesian and Malaysian translator in Google Summer of Code 2012.

Work plan[edit]

Week Dates Main activities Coverage reached (wp) Trimmed coverage reached (wp) Testvoc clean Evaluation WER reached
0 23/04—21/05 Translating the story to get a baseline WER. 500 words 4.68% (id->ms)
1 21/05—27/05 Working on Indonesian analyzer/generator.
2 28/05—03/06 Working on Indonesian analyzer/generator.
Translating Malaysian wikipedia articles to Indonesian to get a parallel corpus.
Bilingual dictionaries will be extracted from the corpus.
72.9%, 29.9%
3 04/06—10/06 Translating Malaysian wikipedia articles to Indonesian.
Working on Malaysian analyzer/generator.
74.9%, 46.4%
4 11/06—17/06 Working on Malaysian analyzer/generator and bidix. 75.6%, 72.9%
5 18/06—24/06 Working on Malaysian analyzer/generator and bidix. 80.1%, 77.5% 300 words 2.97% (ms->id)
6 25/06—01/07 Working on bidix. 80.1%, - 73.3%, - <ij> <cnjcoo> <cnjsub> <cnjadv> <det> <pr> <num> <prn> <np> <adv>
7 02/07—08/07 Working on bidix. 80.3%, 77.1% 76.5%, 74.6% 500 words 24.34% (ms->id)
8 09/07—15/07 Parallel corpus development.
9 16/07—22/07 Working on bidix.
10 23/07—29/07 A little break during this period.
11 30/07—5/08 Working on bidix.
12 6/08—12/08 Cleaning up. 80.7%, 80.1% 80.7%, 80.1% all categories clean 2,000 words 14.43% (id->ms), 7.58% (ms->id)

Ideas for getting Indonesian-Malaysian bilingual dictionaries[edit]

  1. Filtering the Indonesian lemma list. For each lemma, check whether they are also valid Malaysian words.
  2. Interlanguage wiki links.
  3. Extracting bilingual dictionaries from parallel corpus.

See also[edit]

External links[edit]