Difference between revisions of "User:Francis Tyers/An MT system in one thousand steps"

From Apertium
Jump to navigation Jump to search
Line 12: Line 12:
 
** Find a parallel corpus of language X and language Y
 
** Find a parallel corpus of language X and language Y
   
==Morphological analysers (200 tasks)==
+
==Morphological analysers (~200 tasks)==
   
 
For languages X and Y:
 
For languages X and Y:
Line 22: Line 22:
 
** Add numerals (1 task)
 
** Add numerals (1 task)
 
*** At least 1-100 leaving out compositional numerals
 
*** At least 1-100 leaving out compositional numerals
  +
* Categorise words by frequency
* Create frequency lists from your corpora
+
** Create frequency lists from your corpora
 
** Categorise words (15 tasks)
 
** Categorise words (15 tasks)
 
* Add open categories by frequency
 
* Add open categories by frequency
Line 30: Line 31:
 
** Add adverbs (3 tasks)
 
** Add adverbs (3 tasks)
 
** Add verbs (20 tasks)
 
** Add verbs (20 tasks)
  +
  +
For adding the open categories, we assume around 100 words per task.
   
 
==Bilingual dictionary==
 
==Bilingual dictionary==

Revision as of 15:03, 30 October 2013

The idea of this page is to split the creation of a new language pair into bite-sized chunks that could be done in around two-hours or less by an experienced developer. One use of the page might be to organise work into tasks for the Google Code-in or to parallelise development between multiple people.

Research

  • Amass resources (1 task)
    • Find a grammar of language X and of language Y
    • Find a bilingual dictionary X-Y
    • Find bilingual dictionaries X-Z and Y-Z
    • Find 1-3 large monolingual corpora of language X and language Y
    • Find a parallel corpus of language X and language Y

Morphological analysers (~200 tasks)

For languages X and Y:

  • Add closed categories
    • Add adpositions and conjunctions (1 task)
    • Add determiners (1 task)
    • Add pronouns (1 task)
    • Add numerals (1 task)
      • At least 1-100 leaving out compositional numerals
  • Categorise words by frequency
    • Create frequency lists from your corpora
    • Categorise words (15 tasks)
  • Add open categories by frequency
    • Add nouns (26 tasks)
    • Add proper nouns (16 tasks)
    • Add adjectives (15 tasks)
    • Add adverbs (3 tasks)
    • Add verbs (20 tasks)

For adding the open categories, we assume around 100 words per task.

Bilingual dictionary

  • Morphologically analyse and word align parallel corpus
    • Extract bilingual dictionary candidates
    • Proofread and add candidates by frequency
  • Find freely available dictionaries online
    • Convert to lttoolbox format

Lexical selection

  • POS tag and word align parallel corpus
    • Extract default translation rules


Disambiguation

  • Make a list of most frequent ambiguities

Transfer rules

  • Write a contrastive grammar

Evaluation