Difference between revisions of "User:Francis Tyers/Sandbox2"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
  +
Constraint-based lexical selection for rule-based machine translation
  +
 
<pre>
 
<pre>
 
Corpus: cawiki-20110616-pages-articles.xml.bz2
 
Corpus: cawiki-20110616-pages-articles.xml.bz2
Line 17: Line 19:
 
Test corpus:
 
Test corpus:
   
  +
* 150 test words
* 2,000 sentences
+
* 1,500 sentences
 
* 10 per test word
 
* 10 per test word
 
* Randomly selected from the subset of sentences which were found in the corpus.
 
* Randomly selected from the subset of sentences which were found in the corpus.
 
* Only words with >100 example sentences included
 
* Only words with >100 example sentences included
 
* Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
   
  +
Training corpus:
Baseline:
 
   
  +
Baselines:
* Idea: Full analysis:Full analysis dic from Giza++
 
*: This would require a parallel corpus.
 
   
  +
* TL Frequency-best
Rationale:
 
  +
* TLM-best
  +
* Linguist set
   
 
* Full analysis:Full analysis dic from Giza++
* Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.
 
  +
* Rules from phrase table
 
==Testing==
 
 
;Input: Les Carmelites el veneren com a sant patró seu.
 
 
<pre>
 
^El<det><def><f><pl>/The<det><def><f><pl>$
 
^*Carmelites/*Carmelites$
 
^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$
 
^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$
 
^com a<pr>/as a<pr>$ ^sant<adj><m><sg>/saint<adj><m><sg>$
 
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$
 
^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$
 
</pre>
 
 
;Reference:
 
 
<pre>
 
235626 ]^El<det><def><f><pl>/The<det><def><f><pl>$
 
^*Carmelites/*Carmelites$
 
^prpers<prn><pro><p3><m><sg>/prpers<prn><obj><p3><nt><sg>$
 
^venerar<vblex><pri><p3><pl>/venerate<vblex><pri><p3><pl>$
 
^com a<pr>/as a<pr>$
 
^sant<adj><m><sg>/saint<adj><m><sg>$
 
^patró<n><m><sg>/patron<n><sg>$
 
^seu<adj><pos><m><sg>/his<adj><pos><m><sg>$^.<sent>/.<sent>$[
 
</pre>
 
 
;Test 1 (1/6)
 
 
<pre>
 
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/head<n><sg>/pattern<n><sg>/employer<n><sg>$
 
</pre>
 
 
;Test 2 (1/1)
 
 
<pre>
 
^patró<n><m><sg>/patron<n><sg>$
 
</pre>
 
 
;Test 3 (1/4)
 
 
<pre>
 
^patró<n><m><sg>/patron<n><sg>/owner<n><sg>/master<n><sg>/employer<n><sg>$
 
</pre>
 

Revision as of 11:05, 7 September 2011

Constraint-based lexical selection for rule-based machine translation

Corpus: cawiki-20110616-pages-articles.xml.bz2
          cleaned with `aq-wikicrp'

1758582 lines
531983  unique analyses 
531436  lines with >1 translation (30%)
2740    analyses with >1 translation
287     words (lemma+pos) with >1 translation in corpus
712     words in dictionary with >1 translation

1.03    fertility of dictionary over corpus (e.g. total number of word:word translations / total number of words)


Test corpus:

  • 150 test words
  • 1,500 sentences
  • 10 per test word
  • Randomly selected from the subset of sentences which were found in the corpus.
  • Only words with >100 example sentences included
  • Rationale: Dictionary doesn't provide good enough coverage to produce statistically significant results over a whole corpus.

Training corpus:

Baselines:

  • TL Frequency-best
  • TLM-best
  • Linguist set
  • Full analysis:Full analysis dic from Giza++
  • Rules from phrase table