Difference between revisions of "User:Francis Tyers/Sandbox2"
Jump to navigation
Jump to search
Line 38: | Line 38: | ||
* Full analysis:Full analysis dic from Giza++ |
* Full analysis:Full analysis dic from Giza++ |
||
* Rules from phrase table |
* Rules from phrase table |
||
Process for using GIZA++: |
|||
* Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair. |
|||
* Extract the model/lex.f2e file. |
|||
* Take the top scoring analysis:analysis results where the POS matches |
|||
* Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag. |
Revision as of 11:49, 14 September 2011
Constraint-based lexical selection for rule-based machine translation
Corpus: cawiki-20110616-pages-articles.xml.bz2 cleaned with `aq-wikicrp' 1758582 lines 531983 unique analyses 531436 lines with >1 translation (30%) 2740 analyses with >1 translation 287 words (lemma+pos) with >1 translation in corpus 712 words in dictionary with >1 translation 1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
Test corpus:
- 150 test words
- 1,500 sentences
- 10 per test word
- Randomly selected from the subset of sentences which were found in the corpus.
- Only words with >100 example sentences included
- Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
- Discarding because of bad tagging: 'to', 'sol' (less than 50% correct)
Training corpus:
Baselines:
- TL Frequency-best
- TLM-best
- Linguist set
- Full analysis:Full analysis dic from Giza++
- Rules from phrase table
Process for using GIZA++:
- Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair.
- Extract the model/lex.f2e file.
- Take the top scoring analysis:analysis results where the POS matches
- Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag.