User:Francis Tyers/Sandbox2

From Apertium
Jump to navigation Jump to search

Constraint-based lexical selection for rule-based machine translation

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses 
531436  lines with >1 translation (30%)
2740    analyses with >1 translation
287     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)


Test corpus:

  • 150 test words
  • 1,500 sentences
  • 10 per test word
  • Randomly selected from the subset of sentences which were found in the corpus.
  • Only words with >100 example sentences included
  • Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
  • Discarding because of bad tagging: 'to', 'sol', 'portada', 'cap' (less than 60% correct)

Training corpus:

Baselines:

  • TL Frequency-best
  • TLM-best
  • Linguist set
  • Full analysis:Full analysis dic from Giza++
  • Rules from phrase table


Process for using GIZA++:

  • Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair.
  • Extract the model/lex.f2e file.
  • Take the top scoring analysis:analysis results where the POS matches
  • Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag.