Difference between revisions of "User:Francis Tyers/Sandbox2"

From Apertium
Jump to navigation Jump to search
 
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  +
Constraint-based lexical selection for rule-based machine translation
==Agenda==
 
   
  +
<pre>
  +
Corpus: cawiki-20110616-pages-articles.xml.bz2
  +
cleaned with `aq-wikicrp'
   
  +
1758582 lines
For http://xixona.dlsi.ua.es/freerbmt09/
 
  +
531983 unique analyses
  +
531436 lines with >1 translation (30%)
  +
2740 analyses with >1 translation
  +
287 words (lemma+pos) with >1 translation in corpus
  +
712 words in dictionary with >1 translation
   
  +
1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
* Logging on xixona, knowing what people are translating (which language pair etc.)
 
  +
** Possible applications:
 
  +
</pre>
** quality control
 
  +
** encourage language pair maintainers
 
  +
** give an idea of missing terms (on a temporal basis? What's in the news?) - getting the information so we can adapt the translators to what people are translating: if certain topics are coming up in the news ('swine flu' etc.), try to catch them
 
  +
Test corpus:
* Making a 3.2 release -- x-stage transfer, some changes in lttoolbox
 
  +
* Planning for new releases, apertium 3.4, apertium 4.0?
 
  +
* 150 test words
* Webservices -- what, when, where ?
 
  +
* 1,500 sentences
* Should we have a concentrated effort on Revo Vortaro import?
 
  +
* 10 per test word
** Reta Vortaro is fairly consistent; it has clear delineation between simple, unambiguous terms; terms with more than one possible translation (where the first one listed is the preferred default); and polysemous words. Theres even an XML version
 
  +
* Randomly selected from the subset of sentences which were found in the corpus.
*** Who will do quality control ? Every bidix item would need to be proofed
 
  +
* Only words with >100 example sentences included
* Dix profiling - finding out (on a corpus or on testvoc) how often each entry is used, i.a. for removing unused .dix entries - demo by Jacob
 
  +
* Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
  +
  +
* Discarding because of bad tagging/MWE recognition: 'to', 'sol', 'portada', 'cap', 'cop', 'marxa' (less than 60% correct)
  +
  +
Training corpus:
  +
  +
Baselines:
  +
  +
* TL Frequency-best
  +
* TLM-best
  +
* Linguist set
  +
  +
* Full analysis:Full analysis dic from Giza++
  +
* Rules from phrase table
  +
  +
  +
Process for using GIZA++:
  +
  +
* Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair.
  +
* Extract the model/lex.f2e file.
  +
* Take the top scoring analysis:analysis results where the POS matches
  +
* Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag.
  +
  +
==Annotation process==
  +
  +
# Translate corpus (native speaker of English, competent Catalan), adding missing translations to bilingual dictionary options.
  +
#* Words with too many tagging errors, or MWE errors are left out.
  +
# Proofread corpus
  +
# Run corpus up to lexical transfer stage
  +
# Annotate output of lexical transfer

Latest revision as of 08:11, 30 September 2011

Constraint-based lexical selection for rule-based machine translation

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses 
531436  lines with >1 translation (30%)
2740    analyses with >1 translation
287     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)


Test corpus:

  • 150 test words
  • 1,500 sentences
  • 10 per test word
  • Randomly selected from the subset of sentences which were found in the corpus.
  • Only words with >100 example sentences included
  • Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
  • Discarding because of bad tagging/MWE recognition: 'to', 'sol', 'portada', 'cap', 'cop', 'marxa' (less than 60% correct)

Training corpus:

Baselines:

  • TL Frequency-best
  • TLM-best
  • Linguist set
  • Full analysis:Full analysis dic from Giza++
  • Rules from phrase table


Process for using GIZA++:

  • Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair.
  • Extract the model/lex.f2e file.
  • Take the top scoring analysis:analysis results where the POS matches
  • Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag.

Annotation process[edit]

  1. Translate corpus (native speaker of English, competent Catalan), adding missing translations to bilingual dictionary options.
    • Words with too many tagging errors, or MWE errors are left out.
  2. Proofread corpus
  3. Run corpus up to lexical transfer stage
  4. Annotate output of lexical transfer