Difference between revisions of "User:Francis Tyers/Sandbox2"

Revision as of 11:05, 7 September 2011

Constraint-based lexical selection for rule-based machine translation

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses 
531436  lines with >1 translation (30%)
2740    analyses with >1 translation
287     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)

Test corpus:

150 test words
1,500 sentences
10 per test word
Randomly selected from the subset of sentences which were found in the corpus.
Only words with >100 example sentences included
Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.

Training corpus:

Baselines:

TL Frequency-best
TLM-best
Linguist set

Full analysis:Full analysis dic from Giza++
Rules from phrase table

@@ Line 1: / Line 1: @@
+Constraint-based lexical selection for rule-based machine translation
 <pre>
 Corpus: cawiki-20110616-pages-articles.xml.bz2
@@ Line 17: / Line 19: @@
 Test corpus:
+* 150 test words
-* 2,000 sentences
+* 1,500 sentences
 * 10 per test word
 * Randomly selected from the subset of sentences which were found in the corpus.
 * Only words with >100 example sentences included
+* Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
+Training corpus:
-Baseline:
+Baselines:
-* Idea: Full analysis:Full analysis dic from Giza++
-*: This would require a parallel corpus.
+* TL Frequency-best
-Rationale:
+* TLM-best
+* Linguist set
+* Full analysis:Full analysis dic from Giza++
-* Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
+* Rules from phrase table
-==Testing==
-;Input: Les Carmelites el veneren com a sant patró seu.
-<pre>
-^El<det><def><f><pl>/The<det><def><f><pl>$
-^*Carmelites/*Carmelites$
-^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$
-^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$
-^com a<pr>/as a<pr>$ ^sant<adj><m><sg>/saint<adj><m><sg>$
-^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$
-^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$
-</pre>
-;Reference:
-<pre>
-	]^El<det><def><f><pl>/The<det><def><f><pl>$
-^*Carmelites/*Carmelites$
-^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$
-^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$
-^com a<pr>/as a<pr>$
-^sant<adj><m><sg>/saint<adj><m><sg>$
-^patró<n><m><sg>/patron<n><sg>$
-^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$[
-</pre>
-;Test 1 (1/6)
-<pre>
-^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$
-</pre>
-;Test 2 (1/1)
-<pre>
-^patró<n><m><sg>/patron<n><sg>$
-</pre>
-;Test 3 (1/4)
-<pre>
-^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/employer<n><sg>$
-</pre>

Difference between revisions of "User:Francis Tyers/Sandbox2"

Revision as of 11:05, 7 September 2011

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools