Difference between revisions of "User:Francis Tyers/Sandbox2"

From Apertium
Jump to navigation Jump to search
m (→‎Agenda: (What about en-gl, which wasn't even testvoc'ed?))
Line 1: Line 1:
==Agenda==
 
   
  +
<pre>
  +
Corpus: cawiki-20110616-pages-articles.xml.bz2
  +
cleaned with `aq-wikicrp'
   
  +
1758582 lines
For http://xixona.dlsi.ua.es/freerbmt09/
 
  +
531983 unique analyses
  +
2740 analyses with >1 translation
  +
289 words (lemma+pos) with >1 translation in corpus
  +
712 words in dictionary with >1 translation
   
  +
1.03 fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
* Logging on xixona, knowing what people are translating (which language pair etc.)
 
  +
** Possible applications:
 
  +
</pre>
** quality control
 
** encourage language pair maintainers
 
** give an idea of missing terms (on a temporal basis? What's in the news?) - getting the information so we can adapt the translators to what people are translating: if certain topics are coming up in the news ('swine flu' etc.), try to catch them
 
* Making a 3.2 release -- x-stage transfer, some changes in lttoolbox
 
* Planning for new releases, apertium 3.4, apertium 4.0?
 
* Webservices -- what, when, where ?
 
* Should we have a concentrated effort on Revo Vortaro import?
 
** Reta Vortaro is fairly consistent; it has clear delineation between simple, unambiguous terms; terms with more than one possible translation (where the first one listed is the preferred default); and polysemous words. Theres even an XML version
 
*** Who will do the tagging and quality control ? Every bidix item would need to be proofed
 
* Dix profiling - finding out (on a corpus or on testvoc) how often each entry is used, i.a. for removing unused .dix entries - demo by Jacob
 
* Managing user expectations... every released pair should have an evaluation which gives details of the quality a user can expect, e.g. [[Translation quality statistics]] -- These numbers should not just get lost. (What about en-gl, which wasn't even testvoc'ed?)
 

Revision as of 15:53, 2 August 2011

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses
2740    analyses with >1 translation
289     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)