Difference between revisions of "User:Francis Tyers/An MT system in one thousand steps"
Jump to navigation
Jump to search
(17 intermediate revisions by the same user not shown) | |||
Line 12: | Line 12: | ||
** Find a parallel corpus of language X and language Y |
** Find a parallel corpus of language X and language Y |
||
==Morphological analysers (200 tasks)== |
==Morphological analysers (~200 tasks)== |
||
For languages X and Y: |
For languages X and Y: |
||
Line 22: | Line 22: | ||
** Add numerals (1 task) |
** Add numerals (1 task) |
||
*** At least 1-100 leaving out compositional numerals |
*** At least 1-100 leaving out compositional numerals |
||
⚫ | |||
* Create frequency lists from your corpora |
** Create frequency lists from your corpora |
||
⚫ | |||
** Categorise words (15 tasks) [[Task_ideas_for_Google_Code-in/Categorise_words_from_frequency_list|read more]] |
|||
⚫ | |||
** Lemmatise words (15 tasks) [[Task_ideas_for_Google_Code-in/Lemmatise_words_from_frequency_list|read more]] |
|||
⚫ | |||
⚫ | |||
** Add nouns (26 tasks) [[Task_ideas_for_Google_Code-in/Add_nouns_from_frequency_list|read more]] |
|||
** Add proper nouns (16 tasks) |
** Add proper nouns (16 tasks) |
||
** Add adjectives (15 tasks) |
** Add adjectives (15 tasks) |
||
** Add adverbs (3 tasks) |
** Add adverbs (3 tasks) |
||
** Add verbs (20 tasks) |
** Add verbs (20 tasks) |
||
For adding the open categories, we assume around 100 words per task. |
|||
==Bilingual dictionary== |
==Bilingual dictionary== |
||
* Add closed categories (1 task) |
|||
* Morphologically analyse and word align parallel corpus |
* Morphologically analyse and word align parallel corpus |
||
** Extract bilingual dictionary candidates |
** Extract bilingual dictionary candidates |
||
** Proofread and add candidates by frequency |
** Proofread and add open category candidates by frequency |
||
* |
* Take freely available dictionaries online |
||
** Convert to lttoolbox format |
** Convert to lttoolbox format |
||
*** Add and check nouns (2 tasks) |
|||
*** Add and check verbs (2 tasks) |
|||
*** Add and check adjectives (2 tasks) |
|||
*** Add and check adverbs (2 tasks) |
|||
* Add entries manually by frequency |
|||
⚫ | |||
** Add verbs (10 tasks) |
|||
** Add adjectives (10 tasks) |
|||
** Add adverbs (10 tasks) |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
* Write disambiguation rules for most frequent POS+lemma ambiguities (15 tasks) |
|||
⚫ | |||
* Write disambiguation rules for most frequent POS ambiguities (15 tasks) |
|||
* Hand-annotate 500 words of running text (20 tasks) [[Task_ideas_for_Google_Code-in/Manually disambiguate text|read more]] |
|||
* Train statistical POS tagger (1 task) |
|||
* Find bad POS disambiguation leading to bad translation (15 tasks) |
|||
* Write rules to fix bad POS disambiguation (15 tasks) |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
* Extract context rules (maximum entropy) (1 task) |
|||
⚫ | |||
* Make a list of most frequent ambiguities (both lemma ambiguities and POS ambiguities) (1 task) |
|||
* Write lexical selection rules for frequent ambiguities (10 tasks) |
|||
==Transfer rules== |
==Transfer rules== |
||
* Write a contrastive grammar |
* Write a contrastive grammar |
||
==Testvoc== |
|||
==Evaluation== |
==Evaluation== |
||
* Translate 500 words of text, postedit and calculate WER (4 tasks) |
Latest revision as of 18:52, 7 November 2016
The idea of this page is to split the creation of a new language pair into bite-sized chunks that could be done in around two-hours or less by an experienced developer. One use of the page might be to organise work into tasks for the Google Code-in or to parallelise development between multiple people.
Research[edit]
- Amass resources (1 task)
- Find a grammar of language X and of language Y
- Find a bilingual dictionary X-Y
- Find bilingual dictionaries X-Z and Y-Z
- Find 1-3 large monolingual corpora of language X and language Y
- Find a parallel corpus of language X and language Y
Morphological analysers (~200 tasks)[edit]
For languages X and Y:
- Add closed categories
- Add adpositions and conjunctions (1 task)
- Add determiners (1 task)
- Add pronouns (1 task)
- Add numerals (1 task)
- At least 1-100 leaving out compositional numerals
- Categorise and lemmatise words by frequency
- Add open categories by frequency
- Add nouns (26 tasks) read more
- Add proper nouns (16 tasks)
- Add adjectives (15 tasks)
- Add adverbs (3 tasks)
- Add verbs (20 tasks)
For adding the open categories, we assume around 100 words per task.
Bilingual dictionary[edit]
- Add closed categories (1 task)
- Morphologically analyse and word align parallel corpus
- Extract bilingual dictionary candidates
- Proofread and add open category candidates by frequency
- Take freely available dictionaries online
- Convert to lttoolbox format
- Add and check nouns (2 tasks)
- Add and check verbs (2 tasks)
- Add and check adjectives (2 tasks)
- Add and check adverbs (2 tasks)
- Convert to lttoolbox format
- Add entries manually by frequency
- Add nouns (10 tasks)
- Add verbs (10 tasks)
- Add adjectives (10 tasks)
- Add adverbs (10 tasks)
Disambiguation[edit]
- Make a list of most frequent ambiguities (both lemma ambiguities and POS ambiguities) (1 task)
- Write disambiguation rules for most frequent POS+lemma ambiguities (15 tasks)
- Write disambiguation rules for most frequent POS ambiguities (15 tasks)
- Hand-annotate 500 words of running text (20 tasks) read more
- Train statistical POS tagger (1 task)
- Find bad POS disambiguation leading to bad translation (15 tasks)
- Write rules to fix bad POS disambiguation (15 tasks)
Lexical selection[edit]
- POS tag and word align parallel corpus (1 task)
- Extract default translation rules (1 task)
- Extract context rules (maximum entropy) (1 task)
- Make a list of most frequent ambiguities (both lemma ambiguities and POS ambiguities) (1 task)
- Write lexical selection rules for frequent ambiguities (10 tasks)
Transfer rules[edit]
- Write a contrastive grammar
Testvoc[edit]
Evaluation[edit]
- Translate 500 words of text, postedit and calculate WER (4 tasks)