Difference between revisions of "User:Francis Tyers/An MT system in one thousand steps"

From Apertium
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 22: Line 22:
 
** Add numerals (1 task)
 
** Add numerals (1 task)
 
*** At least 1-100 leaving out compositional numerals
 
*** At least 1-100 leaving out compositional numerals
* Categorise words by frequency
+
* Categorise and lemmatise words by frequency
 
** Create frequency lists from your corpora
 
** Create frequency lists from your corpora
** Categorise words (15 tasks)
+
** Categorise words (15 tasks) [[Task_ideas_for_Google_Code-in/Categorise_words_from_frequency_list|read more]]
  +
** Lemmatise words (15 tasks) [[Task_ideas_for_Google_Code-in/Lemmatise_words_from_frequency_list|read more]]
* Add open categories by frequency
+
* Add open categories by frequency
** Add nouns (26 tasks)
 
  +
** Add nouns (26 tasks) [[Task_ideas_for_Google_Code-in/Add_nouns_from_frequency_list|read more]]
 
** Add proper nouns (16 tasks)
 
** Add proper nouns (16 tasks)
 
** Add adjectives (15 tasks)
 
** Add adjectives (15 tasks)
Line 42: Line 43:
 
* Take freely available dictionaries online
 
* Take freely available dictionaries online
 
** Convert to lttoolbox format
 
** Convert to lttoolbox format
  +
*** Add and check nouns (2 tasks)
  +
*** Add and check verbs (2 tasks)
  +
*** Add and check adjectives (2 tasks)
  +
*** Add and check adverbs (2 tasks)
  +
* Add entries manually by frequency
 
** Add nouns (10 tasks)
  +
** Add verbs (10 tasks)
  +
** Add adjectives (10 tasks)
  +
** Add adverbs (10 tasks)
   
 
==Disambiguation==
==Lexical selection==
 
   
 
* Make a list of most frequent ambiguities (both lemma ambiguities and POS ambiguities) (1 task)
* POS tag and word align parallel corpus
 
  +
* Write disambiguation rules for most frequent POS+lemma ambiguities (15 tasks)
** Extract default translation rules
 
  +
* Write disambiguation rules for most frequent POS ambiguities (15 tasks)
  +
* Hand-annotate 500 words of running text (20 tasks) [[Task_ideas_for_Google_Code-in/Manually disambiguate text|read more]]
  +
* Train statistical POS tagger (1 task)
  +
* Find bad POS disambiguation leading to bad translation (15 tasks)
  +
* Write rules to fix bad POS disambiguation (15 tasks)
   
 
==Lexical selection==
   
 
* POS tag and word align parallel corpus (1 task)
==Disambiguation==
 
 
* Extract default translation rules (1 task)
 
  +
* Extract context rules (maximum entropy) (1 task)
* Make a list of most frequent ambiguities
 
  +
* Make a list of most frequent ambiguities (both lemma ambiguities and POS ambiguities) (1 task)
  +
* Write lexical selection rules for frequent ambiguities (10 tasks)
   
 
==Transfer rules==
 
==Transfer rules==
   
 
* Write a contrastive grammar
 
* Write a contrastive grammar
  +
  +
==Testvoc==
  +
   
 
==Evaluation==
 
==Evaluation==
  +
  +
* Translate 500 words of text, postedit and calculate WER (4 tasks)

Latest revision as of 18:52, 7 November 2016

The idea of this page is to split the creation of a new language pair into bite-sized chunks that could be done in around two-hours or less by an experienced developer. One use of the page might be to organise work into tasks for the Google Code-in or to parallelise development between multiple people.

Research[edit]

  • Amass resources (1 task)
    • Find a grammar of language X and of language Y
    • Find a bilingual dictionary X-Y
    • Find bilingual dictionaries X-Z and Y-Z
    • Find 1-3 large monolingual corpora of language X and language Y
    • Find a parallel corpus of language X and language Y

Morphological analysers (~200 tasks)[edit]

For languages X and Y:

  • Add closed categories
    • Add adpositions and conjunctions (1 task)
    • Add determiners (1 task)
    • Add pronouns (1 task)
    • Add numerals (1 task)
      • At least 1-100 leaving out compositional numerals
  • Categorise and lemmatise words by frequency
    • Create frequency lists from your corpora
    • Categorise words (15 tasks) read more
    • Lemmatise words (15 tasks) read more
  • Add open categories by frequency
    • Add nouns (26 tasks) read more
    • Add proper nouns (16 tasks)
    • Add adjectives (15 tasks)
    • Add adverbs (3 tasks)
    • Add verbs (20 tasks)

For adding the open categories, we assume around 100 words per task.

Bilingual dictionary[edit]

  • Add closed categories (1 task)
  • Morphologically analyse and word align parallel corpus
    • Extract bilingual dictionary candidates
    • Proofread and add open category candidates by frequency
  • Take freely available dictionaries online
    • Convert to lttoolbox format
      • Add and check nouns (2 tasks)
      • Add and check verbs (2 tasks)
      • Add and check adjectives (2 tasks)
      • Add and check adverbs (2 tasks)
  • Add entries manually by frequency
    • Add nouns (10 tasks)
    • Add verbs (10 tasks)
    • Add adjectives (10 tasks)
    • Add adverbs (10 tasks)

Disambiguation[edit]

  • Make a list of most frequent ambiguities (both lemma ambiguities and POS ambiguities) (1 task)
  • Write disambiguation rules for most frequent POS+lemma ambiguities (15 tasks)
  • Write disambiguation rules for most frequent POS ambiguities (15 tasks)
  • Hand-annotate 500 words of running text (20 tasks) read more
  • Train statistical POS tagger (1 task)
  • Find bad POS disambiguation leading to bad translation (15 tasks)
  • Write rules to fix bad POS disambiguation (15 tasks)

Lexical selection[edit]

  • POS tag and word align parallel corpus (1 task)
  • Extract default translation rules (1 task)
  • Extract context rules (maximum entropy) (1 task)
  • Make a list of most frequent ambiguities (both lemma ambiguities and POS ambiguities) (1 task)
  • Write lexical selection rules for frequent ambiguities (10 tasks)

Transfer rules[edit]

  • Write a contrastive grammar

Testvoc[edit]

Evaluation[edit]

  • Translate 500 words of text, postedit and calculate WER (4 tasks)