Difference between revisions of "User:Francis Tyers/Sandbox2"
Jump to navigation
Jump to search
Line 12: | Line 12: | ||
</pre> |
</pre> |
||
Test corpus: |
|||
* 2,000 sentences |
|||
* 10 per test word |
|||
* Randomly selected from the subset of sentences which were found in the corpus. |
|||
* Only words with >100 example sentences included |
|||
Rationale: |
|||
* Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus. |
Revision as of 16:02, 2 August 2011
Corpus: cawiki-20110616-pages-articles.xml.bz2 cleaned with `aq-wikicrp' 1758582 lines 531983 unique analyses 2740 analyses with >1 translation 287 words (lemma+pos) with >1 translation in corpus 712 words in dictionary with >1 translation 1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
Test corpus:
- 2,000 sentences
- 10 per test word
- Randomly selected from the subset of sentences which were found in the corpus.
- Only words with >100 example sentences included
Rationale:
- Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.