Difference between revisions of "User:Francis Tyers/Sandbox2"
Jump to navigation
Jump to search
m (→Agenda: (What about en-gl, which wasn't even testvoc'ed?)) |
|||
(14 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | Constraint-based lexical selection for rule-based machine translation |
||
− | ==Agenda== |
||
+ | <pre> |
||
+ | Corpus: cawiki-20110616-pages-articles.xml.bz2 |
||
+ | cleaned with `aq-wikicrp' |
||
+ | 1758582 lines |
||
− | For http://xixona.dlsi.ua.es/freerbmt09/ |
||
+ | 531983 unique analyses |
||
+ | 531436 lines with >1 translation (30%) |
||
+ | 2740 analyses with >1 translation |
||
+ | 287 words (lemma+pos) with >1 translation in corpus |
||
+ | 712 words in dictionary with >1 translation |
||
+ | 1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words) |
||
− | * Logging on xixona, knowing what people are translating (which language pair etc.) |
||
+ | |||
− | ** Possible applications: |
||
+ | </pre> |
||
− | ** quality control |
||
+ | |||
− | ** encourage language pair maintainers |
||
+ | |||
− | ** give an idea of missing terms (on a temporal basis? What's in the news?) - getting the information so we can adapt the translators to what people are translating: if certain topics are coming up in the news ('swine flu' etc.), try to catch them |
||
+ | Test corpus: |
||
− | * Making a 3.2 release -- x-stage transfer, some changes in lttoolbox |
||
+ | |||
− | * Planning for new releases, apertium 3.4, apertium 4.0? |
||
+ | * 150 test words |
||
− | * Webservices -- what, when, where ? |
||
+ | * 1,500 sentences |
||
− | * Should we have a concentrated effort on Revo Vortaro import? |
||
+ | * 10 per test word |
||
− | ** Reta Vortaro is fairly consistent; it has clear delineation between simple, unambiguous terms; terms with more than one possible translation (where the first one listed is the preferred default); and polysemous words. Theres even an XML version |
||
+ | * Randomly selected from the subset of sentences which were found in the corpus. |
||
− | *** Who will do the tagging and quality control ? Every bidix item would need to be proofed |
||
+ | * Only words with >100 example sentences included |
||
− | * Dix profiling - finding out (on a corpus or on testvoc) how often each entry is used, i.a. for removing unused .dix entries - demo by Jacob |
||
+ | * Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus. |
||
− | * Managing user expectations... every released pair should have an evaluation which gives details of the quality a user can expect, e.g. [[Translation quality statistics]] -- These numbers should not just get lost. (What about en-gl, which wasn't even testvoc'ed?) |
||
+ | |||
+ | * Discarding because of bad tagging/MWE recognition: 'to', 'sol', 'portada', 'cap', 'cop', 'marxa' (less than 60% correct) |
||
+ | |||
+ | Training corpus: |
||
+ | |||
+ | Baselines: |
||
+ | |||
+ | * TL Frequency-best |
||
+ | * TLM-best |
||
+ | * Linguist set |
||
+ | |||
+ | * Full analysis:Full analysis dic from Giza++ |
||
+ | * Rules from phrase table |
||
+ | |||
+ | |||
+ | Process for using GIZA++: |
||
+ | |||
+ | * Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair. |
||
+ | * Extract the model/lex.f2e file. |
||
+ | * Take the top scoring analysis:analysis results where the POS matches |
||
+ | * Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag. |
||
+ | |||
+ | ==Annotation process== |
||
+ | |||
+ | # Translate corpus (native speaker of English, competent Catalan), adding missing translations to bilingual dictionary options. |
||
+ | #* Words with too many tagging errors, or MWE errors are left out. |
||
+ | # Proofread corpus |
||
+ | # Run corpus up to lexical transfer stage |
||
+ | # Annotate output of lexical transfer |
Latest revision as of 08:11, 30 September 2011
Constraint-based lexical selection for rule-based machine translation
Corpus: cawiki-20110616-pages-articles.xml.bz2 cleaned with `aq-wikicrp' 1758582 lines 531983 unique analyses 531436 lines with >1 translation (30%) 2740 analyses with >1 translation 287 words (lemma+pos) with >1 translation in corpus 712 words in dictionary with >1 translation 1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
Test corpus:
- 150 test words
- 1,500 sentences
- 10 per test word
- Randomly selected from the subset of sentences which were found in the corpus.
- Only words with >100 example sentences included
- Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
- Discarding because of bad tagging/MWE recognition: 'to', 'sol', 'portada', 'cap', 'cop', 'marxa' (less than 60% correct)
Training corpus:
Baselines:
- TL Frequency-best
- TLM-best
- Linguist set
- Full analysis:Full analysis dic from Giza++
- Rules from phrase table
Process for using GIZA++:
- Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair.
- Extract the model/lex.f2e file.
- Take the top scoring analysis:analysis results where the POS matches
- Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag.
Annotation process[edit]
- Translate corpus (native speaker of English, competent Catalan), adding missing translations to bilingual dictionary options.
- Words with too many tagging errors, or MWE errors are left out.
- Proofread corpus
- Run corpus up to lexical transfer stage
- Annotate output of lexical transfer