User:Francis Tyers/Sandbox2
< User:Francis Tyers
Jump to navigation
Jump to search
Revision as of 15:11, 2 September 2011 by Francis Tyers (talk | contribs)
Corpus: cawiki-20110616-pages-articles.xml.bz2 cleaned with `aq-wikicrp' 1758582 lines 531983 unique analyses 531436 lines with >1 translation (30%) 2740 analyses with >1 translation 287 words (lemma+pos) with >1 translation in corpus 712 words in dictionary with >1 translation 1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
Test corpus:
- 2,000 sentences
- 10 per test word
- Randomly selected from the subset of sentences which were found in the corpus.
- Only words with >100 example sentences included
Baseline:
- Idea: Full analysis:Full analysis dic from Giza++
- This would require a parallel corpus.
Rationale:
- Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
Testing
- Input
- Les Carmelites el veneren com a sant patró seu.
^El<det><def><f><pl>/The<det><def><f><pl>$ ^*Carmelites/*Carmelites$ ^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$ ^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$ ^com a<pr>/as a<pr>$ ^sant<adj><m><sg>/saint<adj><m><sg>$ ^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$ ^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$
- Reference
235626 ]^El<det><def><f><pl>/The<det><def><f><pl>$ ^*Carmelites/*Carmelites$ ^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$ ^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$ ^com a<pr>/as a<pr>$ ^sant<adj><m><sg>/saint<adj><m><sg>$ ^patró<n><m><sg>/patron<n><sg>$ ^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$[
- Test 1 (1/6)
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$
- Test 2 (1/1)
^patró<n><m><sg>/patron<n><sg>$
- Test 3 (1/4)
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/employer<n><sg>$