Difference between revisions of "User:Francis Tyers/Sandbox2"

Latest revision as of 08:11, 30 September 2011

Constraint-based lexical selection for rule-based machine translation

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses 
531436  lines with >1 translation (30%)
2740    analyses with >1 translation
287     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)

Test corpus:

150 test words
1,500 sentences
10 per test word
Randomly selected from the subset of sentences which were found in the corpus.
Only words with >100 example sentences included
Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.

Discarding because of bad tagging/MWE recognition: 'to', 'sol', 'portada', 'cap', 'cop', 'marxa' (less than 60% correct)

Training corpus:

Baselines:

TL Frequency-best
TLM-best
Linguist set

Full analysis:Full analysis dic from Giza++
Rules from phrase table

Process for using GIZA++:

Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair.
Extract the model/lex.f2e file.
Take the top scoring analysis:analysis results where the POS matches
Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag.

Annotation process[edit]

Translate corpus (native speaker of English, competent Catalan), adding missing translations to bilingual dictionary options.
- Words with too many tagging errors, or MWE errors are left out.
Proofread corpus
Run corpus up to lexical transfer stage
Annotate output of lexical transfer

@@ Line 1: / Line 1: @@
+Constraint-based lexical selection for rule-based machine translation
-==Agenda==
+<pre>
+Corpus: cawiki-20110616-pages-articles.xml.bz2
+          cleaned with `aq-wikicrp'
+1758582 lines
-For http://xixona.dlsi.ua.es/freerbmt09/
+unique analyses
+lines with >1 translation (30%)
+analyses with >1 translation
+words (lemma+pos) with >1 translation in corpus
+words in dictionary with >1 translation
+.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)
-* Logging on xixona, knowing what people are translating (which language pair etc.)
-** Possible applications:
+</pre>
-** quality control
-** encourage language pair maintainers
-** give an idea of missing terms (on a temporal basis? What's in the news?) - getting the information so we can adapt the translators to what people are translating: if certain topics are coming up in the news ('swine flu' etc.), try to catch them
+Test corpus:
-* Making a 3.2 release -- x-stage transfer, some changes in lttoolbox
-* Planning for new releases, apertium 3.4, apertium 4.0?
+* 150 test words
-* Webservices -- what, when, where ?
+* 1,500 sentences
-* Should we have a concentrated effort on Revo Vortaro import?
+* 10 per test word
-** Reta Vortaro is fairly consistent; it has clear delineation between simple, unambiguous terms; terms with more than one possible translation (where the first one listed is the preferred default); and polysemous words. Theres even an XML version
+* Randomly selected from the subset of sentences which were found in the corpus.
-*** Who will do the tagging and quality control ? Every bidix item would need to be proofed
+* Only words with >100 example sentences included
-* Dix profiling - finding out (on a corpus or on testvoc) how often each entry is used, i.a. for removing unused .dix entries - demo by Jacob
+* Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
-* Managing user expectations... every released pair should have an evaluation which gives details of the quality a user can expect, e.g. [[Translation quality statistics]] -- These numbers should not just get lost. (What about en-gl, which wasn't even testvoc'ed?)
+* Discarding because of bad tagging/MWE recognition: 'to', 'sol', 'portada', 'cap', 'cop', 'marxa' (less than 60% correct)
+Training corpus:
+Baselines:
+* TL Frequency-best
+* TLM-best
+* Linguist set
+* Full analysis:Full analysis dic from Giza++
+* Rules from phrase table
+Process for using GIZA++:
+* Tag both sides of the corpus (europarl, en-ca, first 1,700,000 sentences) with the Apertium language pair.
+* Extract the model/lex.f2e file.
+* Take the top scoring analysis:analysis results where the POS matches
+* Where the word is already ambiguous in the Apertium dictionaries, add the possibilities from GIZA to the dictionary so that they may be chosen -- only added with POS tag.
+==Annotation process==
+# Translate corpus (native speaker of English, competent Catalan), adding missing translations to bilingual dictionary options.
+#* Words with too many tagging errors, or MWE errors are left out.
+# Proofread corpus
+# Run corpus up to lexical transfer stage
+# Annotate output of lexical transfer

Difference between revisions of "User:Francis Tyers/Sandbox2"

Latest revision as of 08:11, 30 September 2011

Annotation process[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools