Difference between revisions of "User:Francis Tyers/Sandbox2"

Revision as of 10:00, 12 September 2011

Constraint-based lexical selection for rule-based machine translation

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses 
531436  lines with >1 translation (30%)
2740    analyses with >1 translation
287     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)

Test corpus:

150 test words
1,500 sentences
10 per test word
Randomly selected from the subset of sentences which were found in the corpus.
Only words with >100 example sentences included
Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.

Discarding because of bad tagging: 'to', 'sol' (less than 50% correct)

Training corpus:

Baselines:

TL Frequency-best
TLM-best
Linguist set

Full analysis:Full analysis dic from Giza++
Rules from phrase table

@@ Line 25: / Line 25: @@
 * Only words with >100 example sentences included
 * Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
+* Discarding because of bad tagging: 'to', 'sol' (less than 50% correct)
 Training corpus:

Difference between revisions of "User:Francis Tyers/Sandbox2"

Revision as of 10:00, 12 September 2011

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools