User:Francis Tyers/An MT system in one thousand steps

The idea of this page is to split the creation of a new language pair into bite-sized chunks that could be done in around two-hours or less by an experienced developer. One use of the page might be to organise work into tasks for the Google Code-in or to parallelise development between multiple people.

Research

Amass resources (1 task)
- Find a grammar of language X and of language Y
- Find a bilingual dictionary X-Y
- Find bilingual dictionaries X-Z and Y-Z
- Find 1-3 large monolingual corpora of language X and language Y
- Find a parallel corpus of language X and language Y

Morphological analysers (~200 tasks)

For languages X and Y:

Add closed categories
- Add adpositions and conjunctions (1 task)
- Add determiners (1 task)
- Add pronouns (1 task)
- Add numerals (1 task)
  - At least 1-100 leaving out compositional numerals
Categorise and lemmatise words by frequency
- Create frequency lists from your corpora
- Categorise words (15 tasks)
- Lemmatise words (15 tasks)
Add open categories by frequency
- Add nouns (26 tasks)
- Add proper nouns (16 tasks)
- Add adjectives (15 tasks)
- Add adverbs (3 tasks)
- Add verbs (20 tasks)

For adding the open categories, we assume around 100 words per task.

Bilingual dictionary

Add closed categories (1 task)
Morphologically analyse and word align parallel corpus
- Extract bilingual dictionary candidates
- Proofread and add open category candidates by frequency
Take freely available dictionaries online
- Convert to lttoolbox format

Lexical selection

POS tag and word align parallel corpus
- Extract default translation rules

Disambiguation

Make a list of most frequent ambiguities

Transfer rules

Write a contrastive grammar

User:Francis Tyers/An MT system in one thousand steps

Contents

Research

Morphological analysers (~200 tasks)

Bilingual dictionary

Lexical selection

Disambiguation

Transfer rules

Evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools