Difference between revisions of "User:Francis Tyers/Sandbox2"

From Apertium
Jump to navigation Jump to search
Line 12: Line 12:
   
 
</pre>
 
</pre>
  +
  +
  +
Test corpus:
  +
  +
* 2,000 sentences
  +
* 10 per test word
  +
* Randomly selected from the subset of sentences which were found in the corpus.
  +
* Only words with >100 example sentences included
  +
  +
Rationale:
  +
  +
* Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.

Revision as of 16:02, 2 August 2011

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses
2740    analyses with >1 translation
287     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)


Test corpus:

  • 2,000 sentences
  • 10 per test word
  • Randomly selected from the subset of sentences which were found in the corpus.
  • Only words with >100 example sentences included

Rationale:

  • Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.