Difference between revisions of "User:Francis Tyers/Experiments"

From Apertium
Jump to navigation Jump to search
 
(298 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
{{TOCD}}
  +
  +
==TODO==
  +
  +
* <s>Do LER in/out domain testing for the en-es setup with news commentary.</s>
  +
* <s>Do BLEU in/out domain testing for the en-es setup with news commentary.</s>
  +
* <s>mk-en: why is TLM LER/BLEU so much better ?</s>
  +
** (partial) answer: 0-context rules (e.g. defaults) not applying properly. Fixed by running in series. This "solves" the LER issue.
  +
** (partial) answer: preposition selection is much better. We could try running with ling-default preps.
  +
* Do pairwise bootstrap resampling for each of best baseline + best rules
  +
** (done) for parallel
  +
* <s>why do eu-es rules not improve over freq ?</s>
  +
** (partial) answer: some rules do not apply because of tag wankery. See line #129774 in the test corpus. Need to define better how tags work. Perhaps only include tags where ambiguous ?
  +
* <s>why do breton numbers for monolingual rules not approach TLM ? </s>
  +
** because of crispiness being too low.
  +
* <s>why when we add more data, do the results get worse ? </s>
  +
** because of crispiness being too low.
  +
* rerun the mk-en stuff with frac counts.
  +
* run br-fr test with huge data.
  +
* try decreasing the C with corpus size.
  +
  +
==Corpus stats==
  +
 
{|class=wikitable
 
{|class=wikitable
! Language pair !! Corpus !! Lines !! W. (src) !! SL cov. || W. (train) !! Words (test) !! Uniq. tokens >1 trad. !! Avg. trad / word !!
+
! Pair !! Corpus !! Lines !! W. (src) !! SL cov. || Extracted || Extracted (%) || L. (train) !! L. (test) !! L (dev) !! Uniq. tokens >1 trad. !! Avg. trad / word !!
 
|-
 
|-
| br-fr || oab || || || || || || ||
+
| br-fr || oab || 57,305 || 702,328 || 94.47% || 4,668 || 8.32 || 2,668 || 1,000 || 1,000 || 603 || 1.07
 
|-
 
|-
| en-es || europarl || 1,467,708 || 30,154,098 || 98.08% || - || - || 2,082 || 1.08
+
| en-es || europarl || 1,467,708 || 30,154,098 || 98.08% || 312,162 || 22.18 || 310,162 || 1,000|| 1,000 || 2,082 || 1.08
 
|-
 
|-
| eu-es || opendata.euskadi.net || 765,115 || 10,190,079 || 91.70% || - || - || 1,806 || 1.30
+
| eu-es || opendata.euskadi.net || 765,115 || 10,190,079 || 91.70% || 87,907 || 11.48 || 85,907 || 1,000|| 1,000 || 1,806 || 1.30
 
|-
 
|-
| mk-en || setimes || || || || || || ||
+
| mk-en || setimes || 190,493 || 4,259,338 || 92.17% || 19,747 || 10.94 || 17,747 || 1,000|| 1,000 || 13,134 || 1.86
  +
|-
  +
| sh-mk || setimes || || || || || || || || || ||
 
|-
 
|-
| sh-mk || setimes || || || || || || ||
 
 
|}
 
|}
   
  +
===Evaluation corpus===
==Basque→Spanish==
 
   
  +
====Out of domain====
<pre>
 
2081 cat europako_testuak_memoria_2010.tmx | iconv -f utf-16 -t utf-8 > europako_testuak_memoria_2010.tmx.u8
 
2082 cat 2010_memo_orokorra.tmx | iconv -f utf-16 -t utf-8 > 2010_memo_orokorra.tmx.u8
 
2088 python3 process-tmx.py europako_testuak_memoria_2010.tmx.u8 > europako_testuak_memoria_2010.txt
 
2090 python3 process-tmx.py 2010_memo_orokorra.tmx.u8 > 2010_memo_orokorra.txt
 
2091 cat 2010_memo_orokorra.txt | grep '^es' | cut -f2- > 2010_memo_orokorra.es.txt
 
2092 cat 2010_memo_orokorra.txt | grep '^eu' | cut -f2- > 2010_memo_orokorra.eu.txt
 
2094 cat europako_testuak_memoria_2010.txt | grep '^es' | cut -f2- > europako_testuak_memoria_2010.es.txt
 
2095 cat europako_testuak_memoria_2010.txt | grep '^eu' | cut -f2- > europako_testuak_memoria_2010.eu.txt
 
2099 cat europako_testuak_memoria_2010.es.txt 2010_memo_orokorra.es.txt > opendata.es
 
2100 cat europako_testuak_memoria_2010.eu.txt 2010_memo_orokorra.eu.txt > opendata.eu
 
   
   
  +
{|class=wikitable
$ wc -l opendata.e*
 
  +
! Pair || Lines || Words (L1) || Words (L2) || Ambig. tokens || Ambig. types || Ambig token/type || % ambig || Av. trad/word
782325 opendata.es
 
  +
|-
782325 opendata.eu
 
  +
| en-es || 434 || 9,463 || 10,280 || 619 || 303 || 2.04 || 6.54% || -
  +
|-
  +
|}
   
  +
====In domain====
   
  +
{|class=wikitable
2114 perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl opendata eu es opendata.clean 1 80
 
  +
! Pair || Lines || Words (L1) || Words (L2) || Ambig. tokens || Ambig. types || Ambig token/type || % ambig || Av. trad/word
  +
|-
  +
| br-fr || 1,000 || 13,854 || 13,878 || 1,163 || 372 || 3.13 || 8.39% || -
  +
|-
  +
| en-es || 1,000 || 19,882 || 20,944 || 1,469 || 337 || 4.35 || 7.38% || -
  +
|-
  +
| eu-es || 1,000 || 7,967 || 11,476 || 1,360 || 412 || 3.30 || 17.07% || -
  +
|-
  +
| mk-en || 1,000 || 13,441 || 14,228 || 3,872 || 1,289 || 3.00 || 28.80% || -
  +
|-
  +
|}
   
  +
* % ambig = number of SL tokens with >1 translation
2117 cat opendata.clean.eu |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ eu-es-pretransfer > opendata.tagged.eu
 
2126 cat opendata.clean.es |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ es-eu-pretransfer > opendata.tagged.es &
 
   
  +
==EAMT-style results==
   
  +
===Out of domain===
2132 seq 1 771238 > opendata.lines
 
2133 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f1 > opendata.lines.new
 
2134 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f2 > opendata.tagged.eu.new
 
2135 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f3 > opendata.tagged.es.new
 
   
  +
====LER====
2137 mv opendata.lines.new opendata.lines
 
2138 mv opendata.tagged.es.new opendata.tagged.es
 
2139 mv opendata.tagged.eu.new opendata.tagged.eu
 
   
  +
{|class=wikitable style="text-align: center"
2146 cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil.bin >/tmp/eu-es.bil1
 
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
  +
|-
  +
| en-es || <small>&mdash;</small><br/>[44.5, 52.0] || <small>&mdash;</small><br/>[34.7, 41.9] || <small>667</small><br/>[24.7, 31.9] || <small>630</small><br/>'''[ 21.4 , 28.4 ]''' || <small>2881</small><br/>'''[20.2, 27.2]''' || <small>2728</small><br/>'''[20.2, 27.2]''' || <small>1683</small><br/>[20.7, 27.6] || <small>1578</small><br/>[20.7, 27.6] || <small>1242</small><br/>[20.7, 27.6] || <small>1197</small><br/>[20.7, 27.6]
  +
|-
  +
|}
   
  +
====BLEU====
2148 cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil-noRL.bin >/tmp/eu-es.bil2
 
   
  +
<small>
$ tail -n 1 /tmp/*.poly
 
  +
{|class=wikitable style="text-align: center"
==> /tmp/eu-es.bil1.poly <==
 
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
1.00240014637
 
  +
|-
  +
| en-es || [0.1885, 0.2133] || [0.1953, 0.2201] || [0.1832, 0.2067] || [0.1832, 0.2067] || [0.1831, 0.2067] || [0.1830, 0.2067] || [ [0.1828, 0.2063] ||[0.1828, 0.2063] ||[0.1828, 0.2063] ||
  +
|-
  +
|}
  +
</small>
   
  +
===In domain===
==> /tmp/eu-es.bil2.poly <==
 
1.3015831681
 
   
  +
====LER====
2191 mv /tmp/eu-es.bil2 opendata.biltrans.eu-es
 
   
  +
<math>c</math> is the "crispiness" ratio, the amount of times an alternative translation is seen in a given context compared to the default translation. So, a <math>c</math> of 2.0 means that the translation appears twice as frequently as the default.
2258 cat opendata.tagged.es | python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py es > opendata.token.es
 
2007 cat opendata.tagged.eu | python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py eu > opendata.token.eu
 
2014 cat opendata.biltrans.eu-es | python /home/fran/source/apertium-lex-tools/scripts/process-biltrans-output.py > opendata.token.eu-es &
 
   
  +
{|class=wikitable style="text-align: center"
$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
 
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
/home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus opendata.token -f eu -e es -alignment grow-diag-final-and \
 
  +
|-
-reordering msd-bidirectional-fe -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
 
  +
| br-fr || <small>&mdash;</small><br/>[58.9, 64.8] || <small>&mdash;</small><br/>[44.2, 50.5] || <small>168</small><br/>[54.8, 60.7] || <small>115</small><br/>[28.5, 34.1] || <small>221</small><br/>[27.8, 33.3] ||align="center"| <small>213</small><br/>[27.6, 33.0] || <small>159</small><br/>[26.3, 31.8] || <small>150</small><br/>'''[26.1, 31.6]''' || <small>135</small><br/>[27.2, 32.8] || <small>135</small><br/> [27.2, 32.8]
   
  +
|-
2011 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f1 > opendata.lines.new&
 
  +
| en-es || <small>&mdash;</small><br/>[21.0, 25.3] || <small>&mdash;</small><br/>[15.1, 18.9] || <small>667</small><br/>[20.7, 25.1 || <small>630</small><br/>[7.2, 10.0] || <small>2881</small><br/>[5.9, 8.6] || <small>2728</small><br/>[6.0, 8.6] || <small>1683</small><br/>'''[5.7, 8.3]''' || <small>1578</small><br/>'''[5.7, 8.3]''' || <small>1242</small><br/>[6.0, 8.5] || <small>1197</small><br/>[5.9, 8.6]
2013 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f2 > opendata.eu.new &
 
  +
|-
2014 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f3 > opendata.es.new &
 
  +
| eu-es || <small>&mdash;</small><br/>[41.1, 46.6] || <small>&mdash;</small><br/>[38.8, 44.2] || <small>697</small><br/>[47.8, 53.0] || <small>598</small><br/>[16.5, 20.8] || <small>2253</small><br/>[20.2, 24.7] || <small>2088</small><br/>[17.2, 21.7] || <small>1382</small><br/>[16.8, 21.0] || <small>1266</small><br/>[16.1, 20.4] || <small>1022</small><br/>'''[15.9, 20.2]''' ||<small>995</small><br/>[16.0, 20.3]
  +
|-
  +
| mk-en || <small>&mdash;</small><br/>[42.4, 46.3] || <small>&mdash;</small><br/>[27.1, 30.8] || <small>1385</small><br/>[28.8, 32.6] || <small>1079</small><br/>[19.0, 22.2] || <small>1684</small><br/>'''[18.5, 21.5]''' || <small>1635</small><br/>[18.6, 21.6] || <small>1323</small><br/>[19.1, 22.2] || <small>1271</small><br/>[19.0, 22.0] || <small>1198</small><br/>[19.1, 22.1] || <small>1079</small><br/> [19.1, 22.1]
  +
|-
  +
|}
   
  +
====BLEU====
   
  +
<small>
2017 mv opendata.lines.new opendata.lines
 
  +
{|class=wikitable style="text-align: center"
2018 mv opendata.es.new opendata.token.es
 
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
2019 mv opendata.eu.new opendata.token.eu
 
  +
|-
  +
| br-fr || <small>&mdash;</small><br/>[0.1247, 0.1420] || <small>&mdash;</small><br/><b>[0.1397, 0.1572]</b> || <small>168</small><br/>[0.1325, 0.1503] || <small>115</small><br/>[0.1344, 0.1526] || <small>221</small><br/>[0.1367, 0.1551] ||align="center"| <small>213</small><br/>[0.1367, 0.1549] || <small>159</small><br/><b>[0.1374, 0.1554]</b> || <small>150</small><br/>[0.1364, 0.1543] || <small>135</small><br/>[0.1352, 0.1535] || <small>135</small><br/>[0.1352, 0.1535]
  +
|-
  +
| en-es || <small>&mdash;</small><br/>[0.2151, 0.2340] || <small>&mdash;</small><br/>[0.2197, 0.2384] || <small>667</small><br/>[0.2148, 0.2337] || <small>630</small><br/>[0.2208, 0.2398] || <small>2881</small><br/>[0.2217, 0.2405] || <small>2728</small><br/>[0.2217, 0.2406] || <small>1683</small><br/><b>[0.2217, 0.2407]</b> || <small>1578</small><br/><b>[0.2217, 0.2407]</b> || <small>1242</small><br/>[0.2217, 0.2407] || <small>1197</small><br/>[0.2217, 0.2408]
  +
|-
  +
| eu-es || <small>&mdash;</small><br/>[0.0873, 0.1038] || <small>&mdash;</small><br/>[0.0921, 0.1093] || <small>697</small><br/>[0.0870, 0.1030] || <small>598</small><br/>[0.0972, 0.1149] || <small>2253</small><br/>[0.0965, 0.1142] || <small>2088</small><br/>[0.0971, 0.1147] || <small>1382</small><br/>[0.0971, 0.1148] || <small>1266</small><br/>[0.0971, 0.1148] || <small>1022</small><br/>'''[0.0973, 0.1150]''' || <small>995</small><br/>'''[0.0973, 0.1150]'''
  +
|-
  +
| mk-en || <small>&mdash;</small><br/>[0.2300, 0.2511] || <small>&mdash;</small><br/>'''[0.2976, 0.3230]''' || <small>1385</small><br/>[0.2337, 0.2563] || <small>1079</small><br/>[0.2829, 0.3064] || <small>1684</small><br/>'''[0.2838, 0.3071]''' || <small>1635</small><br/>[0.2834, 0.3067] || <small>1323</small><br/> [0.2825, 0.3058] || <small>1271</small><br/>[0.2827, 0.3059] || <small>1198</small><br/>[0.2827, 0.3059] || <small>1079</small><br/>
  +
|-
  +
|}
  +
</small>
   
  +
==Learning monolingually (winner-takes-all)==
  +
  +
Setup:
  +
  +
* SL side of the training corpus
  +
* All possibilities translated and scored
  +
* Absolute winners taken
  +
* Rules generated by counting ngrams in the same way as with the parallel corpus, only no alignment needed as it works like an annotated corpus.
  +
  +
===Out of domain===
  +
  +
====LER====
  +
  +
{|class=wikitable style="text-align: center"
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
  +
|-
  +
| en-es || [44.5, 52.0] || [34.7, 41.9] || [24.7, 31.9] || || [30.2, 37.9] || [30.2, 37.9] || [29.2, 37.0] || [29.3, 36.8] || '''[29.0, 36.4]''' || [29.1, 36.5]
  +
|-
  +
|}
  +
  +
====BLEU====
  +
  +
<small>
  +
{|class=wikitable style="text-align: center"
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
  +
|-
  +
| en-es || [0.1885, 0.2133] || [0.1953, 0.2201] || [0.1832, 0.2067] || || [0.1806, 0.2042] || [0.1806, 0.2042] || [0.1808, 0.2043] || [0.1810, 0.2046] || [0.1809, 0.2045] || [0.1809, 0.2045]
  +
|-
  +
|}
  +
</small>
  +
  +
===In domain===
  +
  +
====LER====
  +
  +
{|class=wikitable style="text-align: center"
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
  +
|-
  +
| br-fr || <small>&mdash;</small><br/>[58.9, 64.8] || <small>&mdash;</small><br/>'''[44.2, 50.5]''' || <small>168</small><br/>[54.8, 60.7] || <small>115</small><br/> || <small>261</small><br/>[53.5, 59.2] ||align="center"| <small>247</small><br/>[52.1, 58.2] || <small>172</small><br/>[54.3, 60.2] || <small>165</small><br/>[52.7, 58.4] || <small>138</small><br/>'''[50.5, 56.3]''' || <small>136</small><br/>[50.6, 56.6]
  +
|-
  +
| en-es || <small>&mdash;</small><br/>[21.0, 25.3] || <small>&mdash;</small><br/>[15.1, 18.9] || <small>667</small><br/>[20.7, 25.1] || <small>?</small><br/>? || <small>2595</small><br/>[15.0, 19.0] || <small>2436</small><br/>[15.1, 19.1] || <small>1520</small><br/>[13.7, 17.6] || <small>1402</small><br/>'''[13.6, 17.3]''' || <small>1065</small><br/>[13.9, 17.7] || <small>1024</small><br/>[13.9, 17.8]
  +
|-
  +
| eu-es || <small>&mdash;</small><br/>[41.1, 46.6] || <small>&mdash;</small><br/>'''[38.8, 44.2]''' || <small>?</small><br/>[47.8, 53.0] || <small>?</small><br/> || <small>2631</small><br/>[40.9, 46.4] || <small>2427</small><br/>[40.9, 46.5] || <small>1186</small><br/>[40.7, 46.1] || <small>1025</small><br/>[40.7, 46.2] || <small>685</small><br/>'''[40.5, 45.9]''' || <small>641</small><br/>'''[40.5, 45.9]'''
  +
|-
  +
| mk-en || <small>&mdash;</small><br/>[42.4, 46.3] || <small>&mdash;</small><br/>'''[27.1, 30.8]''' || <small>1385</small><br/>[28.8, 32.6] || <small>?</small><br/><!--[27.2, 30.8]--> || <small>1698</small><br/>[27.8, 31.5] || <small>1662</small><br/>[27.8, 31.4] || <small>1321</small><br/>[27.8, 31.4] || <small>1285</small><br/>[27.8, 31.4] || <small>1186</small><br/>'''[27.7, 31.4]''' || <small>1180</small><br/>'''[27.7, 31.4]'''
  +
|-
  +
|}
  +
  +
====BLEU====
  +
<small>
  +
{|class=wikitable style="text-align: center"
  +
! Pair !! freq !! tlm !! ling !! alig !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
  +
|-
  +
| br-fr || <small>&mdash;</small><br/>[0.1247, 0.1420] || <small>&mdash;</small><br/><b>[0.1397, 0.1572]</b> || <small>168</small><br/>[0.1325, 0.1503] || <small>115</small><br/> || <small>261</small><br/>[0.1250, 0.1425] ||align="center"| <small>247</small><br/>[0.1252, 0.1429] || <small>172</small><br/>[0.1240, 0.1412] || <small>165</small><br/>[0.1243, 0.1416] || <small>138</small><br/><b>[0.1255, 0.1429]</b> || <small>136</small><br/><b>[0.1255, 0.1429]</b>
  +
|-
  +
| en-es || <small>&mdash;</small><br/>[0.2151, 0.2340] || <small>&mdash;</small><br/><b>[0.2197, 0.2384]</b> || <small>667</small><br/>[0.2148, 0.2337] || <small>?</small><br/> || <small>2595</small><br/>[0.2180, 0.2371] || <small>2436</small><br/>[0.2180, 0.2372] || <small>1520</small><br/>[0.2190, 0.2380] || <small>1402</small><br/><b>[0.2190, 0.2381]</b> || <small>1065</small><br/>[0.2189, 0.2380] || <small>1024</small><br/>[0.2189, 0.2380]
  +
|-
  +
| eu-es || <small>&mdash;</small><br/>[0.0873, 0.1038] || <small>&mdash;</small><br/>'''[0.0921, 0.1093]''' || <small>?</small><br/>[0.0870, 0.1030] || <small>?</small><br/> || <small>2631</small><br/>[0.0875, 0.1040] || <small>2427</small><br/>[0.0878, 0.1042] || <small>1186</small><br/>[0.0878, 0.1043] || <small>1025</small><br/>[0.0878, 0.1043] || <small>685</small><br/>'''[0.0879, 0.1043]''' || <small>641</small><br/>'''[0.0879, 0.1043]'''
  +
|-
  +
| mk-en || <small>&mdash;</small><br/>[0.2300, 0.2511] || <small>&mdash;</small><br/>'''[0.2976, 0.3230]''' || <small>1385</small><br/>[0.2567, 0.2798] || || <small>1698</small><br/>[0.2694, 0.2930] || <small>1662</small><br/>[0.2695, 0.2931] || <small>1321</small><br/>'''[0.2696, 0.2935]''' || <small>1285</small><br/>'''[0.2696, 0.2935]''' || <small>1186</small><br/>[0.2696, 0.2934] || <small>1180</small><br/> [0.2696, 0.2934]
  +
|-
  +
|}
  +
  +
  +
</small>
  +
  +
==Learning monolingually (fractional counts)==
  +
  +
Setup:
  +
  +
* SL side of the training corpus
  +
* All possibilities translated and scored
  +
* Probabilities normalised into fractional counts (e.g. add them up to get a total, then divide each prob by the total).
  +
** log prob converted into normal prob using exp10()
  +
* Rules generated by counting fractions from the translated file.
  +
  +
===In domain===
  +
  +
====LER====
  +
  +
====BLEU====
  +
  +
===Out of domain===
  +
  +
====LER====
  +
  +
====BLEU====
  +
  +
==MaxEnt==
  +
  +
===With alignments===
  +
  +
{|class=wikitable
  +
! Pair || alig || rule-best || ME (>5) || ME (>3)
  +
|-
  +
| br-fr || 33.4 || 31.5 || 31.8 || 29.9
  +
|-
  +
| mk-en || 19.9 || 19.8 || 18.9 || 17.8
  +
|-
  +
| eu-es || 18.5 || 17.9 || 17.4 || 19.9
  +
|-
  +
| en-es || 8.6 || 7.0 || 6.3 || 6.3
  +
|-
  +
|}
  +
  +
===With fractional counts===
  +
  +
{|class=wikitable
  +
! Pair || alig || rule-best || ME (>5) || ME (>3) || ME (>1) || ME (>0)
  +
|-
  +
| br-fr || 43.4 || 43.1 || 61.9 || 46.2 || 48.2 || 49.9
  +
|-
  +
| mk-en || 29.5 || || || ||
  +
|-
  +
| eu-es || 41.2 || || 43.9 || 44.4 ||
  +
|-
  +
| en-es || 11.9 || 11.7 || 11.4 || 11.9 ||
  +
|-
  +
|}
   
  +
==Notes==
2032 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f1 > opendata.lines.new
 
2033 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f2 > opendata.eu.new &
 
2034 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f3 > opendata.es.new &
 
2035 mv opendata.lines.new opendata.lines
 
2036 mv opendata.es.new opendata.token.es
 
2037 mv opendata.eu.new opendata.token.eu
 
   
  +
* [http://acl.ldc.upenn.edu/W/W07/W07-0733.pdf Experiments in Domain Adaptation for Statistical Machine Translation]
</pre>
 
  +
* [http://www.cs.sfu.ca/~anoop/papers/pdf/ssl-smt-mtjournal07.pdf Semi-supervised model adaptation for statistical machine translation]
  +
* [http://www.mt-archive.info/WMT-2009-Bertoldi.pdf Domain Adaptation for Statistical Machine Translation with Monolingual Resources]
  +
*: "We found that the largest gain (25% relative) is achieved when in-domain data are available for the target language. A smaller performance improvement is still observed (5% relative) if source adaptation data are available. We also observed that the most important role is played by the LM adaptation, while the adaptation of the TM and RM gives consistent but small improvement."

Latest revision as of 10:49, 22 November 2012

TODO[edit]

  • Do LER in/out domain testing for the en-es setup with news commentary.
  • Do BLEU in/out domain testing for the en-es setup with news commentary.
  • mk-en: why is TLM LER/BLEU so much better ?
    • (partial) answer: 0-context rules (e.g. defaults) not applying properly. Fixed by running in series. This "solves" the LER issue.
    • (partial) answer: preposition selection is much better. We could try running with ling-default preps.
  • Do pairwise bootstrap resampling for each of best baseline + best rules
    • (done) for parallel
  • why do eu-es rules not improve over freq ?
    • (partial) answer: some rules do not apply because of tag wankery. See line #129774 in the test corpus. Need to define better how tags work. Perhaps only include tags where ambiguous ?
  • why do breton numbers for monolingual rules not approach TLM ?
    • because of crispiness being too low.
  • why when we add more data, do the results get worse ?
    • because of crispiness being too low.
  • rerun the mk-en stuff with frac counts.
  • run br-fr test with huge data.
  • try decreasing the C with corpus size.

Corpus stats[edit]

Pair Corpus Lines W. (src) SL cov. Extracted Extracted (%) L. (train) L. (test) L (dev) Uniq. tokens >1 trad. Avg. trad / word
br-fr oab 57,305 702,328 94.47% 4,668 8.32 2,668 1,000 1,000 603 1.07
en-es europarl 1,467,708 30,154,098 98.08% 312,162 22.18 310,162 1,000 1,000 2,082 1.08
eu-es opendata.euskadi.net 765,115 10,190,079 91.70% 87,907 11.48 85,907 1,000 1,000 1,806 1.30
mk-en setimes 190,493 4,259,338 92.17% 19,747 10.94 17,747 1,000 1,000 13,134 1.86
sh-mk setimes

Evaluation corpus[edit]

Out of domain[edit]

Pair Lines Words (L1) Words (L2) Ambig. tokens Ambig. types Ambig token/type % ambig Av. trad/word
en-es 434 9,463 10,280 619 303 2.04 6.54% -

In domain[edit]

Pair Lines Words (L1) Words (L2) Ambig. tokens Ambig. types Ambig token/type % ambig Av. trad/word
br-fr 1,000 13,854 13,878 1,163 372 3.13 8.39% -
en-es 1,000 19,882 20,944 1,469 337 4.35 7.38% -
eu-es 1,000 7,967 11,476 1,360 412 3.30 17.07% -
mk-en 1,000 13,441 14,228 3,872 1,289 3.00 28.80% -
  •  % ambig = number of SL tokens with >1 translation

EAMT-style results[edit]

Out of domain[edit]

LER[edit]

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
en-es
[44.5, 52.0]

[34.7, 41.9]
667
[24.7, 31.9]
630
[ 21.4 , 28.4 ]
2881
[20.2, 27.2]
2728
[20.2, 27.2]
1683
[20.7, 27.6]
1578
[20.7, 27.6]
1242
[20.7, 27.6]
1197
[20.7, 27.6]

BLEU[edit]

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
en-es [0.1885, 0.2133] [0.1953, 0.2201] [0.1832, 0.2067] [0.1832, 0.2067] [0.1831, 0.2067] [0.1830, 0.2067] [ [0.1828, 0.2063] [0.1828, 0.2063] [0.1828, 0.2063]

In domain[edit]

LER[edit]

is the "crispiness" ratio, the amount of times an alternative translation is seen in a given context compared to the default translation. So, a of 2.0 means that the translation appears twice as frequently as the default.

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
br-fr
[58.9, 64.8]

[44.2, 50.5]
168
[54.8, 60.7]
115
[28.5, 34.1]
221
[27.8, 33.3]
213
[27.6, 33.0]
159
[26.3, 31.8]
150
[26.1, 31.6]
135
[27.2, 32.8]
135
[27.2, 32.8]
en-es
[21.0, 25.3]

[15.1, 18.9]
667
[20.7, 25.1
630
[7.2, 10.0]
2881
[5.9, 8.6]
2728
[6.0, 8.6]
1683
[5.7, 8.3]
1578
[5.7, 8.3]
1242
[6.0, 8.5]
1197
[5.9, 8.6]
eu-es
[41.1, 46.6]

[38.8, 44.2]
697
[47.8, 53.0]
598
[16.5, 20.8]
2253
[20.2, 24.7]
2088
[17.2, 21.7]
1382
[16.8, 21.0]
1266
[16.1, 20.4]
1022
[15.9, 20.2]
995
[16.0, 20.3]
mk-en
[42.4, 46.3]

[27.1, 30.8]
1385
[28.8, 32.6]
1079
[19.0, 22.2]
1684
[18.5, 21.5]
1635
[18.6, 21.6]
1323
[19.1, 22.2]
1271
[19.0, 22.0]
1198
[19.1, 22.1]
1079
[19.1, 22.1]

BLEU[edit]

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
br-fr
[0.1247, 0.1420]

[0.1397, 0.1572]
168
[0.1325, 0.1503]
115
[0.1344, 0.1526]
221
[0.1367, 0.1551]
213
[0.1367, 0.1549]
159
[0.1374, 0.1554]
150
[0.1364, 0.1543]
135
[0.1352, 0.1535]
135
[0.1352, 0.1535]
en-es
[0.2151, 0.2340]

[0.2197, 0.2384]
667
[0.2148, 0.2337]
630
[0.2208, 0.2398]
2881
[0.2217, 0.2405]
2728
[0.2217, 0.2406]
1683
[0.2217, 0.2407]
1578
[0.2217, 0.2407]
1242
[0.2217, 0.2407]
1197
[0.2217, 0.2408]
eu-es
[0.0873, 0.1038]

[0.0921, 0.1093]
697
[0.0870, 0.1030]
598
[0.0972, 0.1149]
2253
[0.0965, 0.1142]
2088
[0.0971, 0.1147]
1382
[0.0971, 0.1148]
1266
[0.0971, 0.1148]
1022
[0.0973, 0.1150]
995
[0.0973, 0.1150]
mk-en
[0.2300, 0.2511]

[0.2976, 0.3230]
1385
[0.2337, 0.2563]
1079
[0.2829, 0.3064]
1684
[0.2838, 0.3071]
1635
[0.2834, 0.3067]
1323
[0.2825, 0.3058]
1271
[0.2827, 0.3059]
1198
[0.2827, 0.3059]
1079

Learning monolingually (winner-takes-all)[edit]

Setup:

  • SL side of the training corpus
  • All possibilities translated and scored
  • Absolute winners taken
  • Rules generated by counting ngrams in the same way as with the parallel corpus, only no alignment needed as it works like an annotated corpus.

Out of domain[edit]

LER[edit]

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
en-es [44.5, 52.0] [34.7, 41.9] [24.7, 31.9] [30.2, 37.9] [30.2, 37.9] [29.2, 37.0] [29.3, 36.8] [29.0, 36.4] [29.1, 36.5]

BLEU[edit]

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
en-es [0.1885, 0.2133] [0.1953, 0.2201] [0.1832, 0.2067] [0.1806, 0.2042] [0.1806, 0.2042] [0.1808, 0.2043] [0.1810, 0.2046] [0.1809, 0.2045] [0.1809, 0.2045]

In domain[edit]

LER[edit]

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
br-fr
[58.9, 64.8]

[44.2, 50.5]
168
[54.8, 60.7]
115
261
[53.5, 59.2]
247
[52.1, 58.2]
172
[54.3, 60.2]
165
[52.7, 58.4]
138
[50.5, 56.3]
136
[50.6, 56.6]
en-es
[21.0, 25.3]

[15.1, 18.9]
667
[20.7, 25.1]
?
?
2595
[15.0, 19.0]
2436
[15.1, 19.1]
1520
[13.7, 17.6]
1402
[13.6, 17.3]
1065
[13.9, 17.7]
1024
[13.9, 17.8]
eu-es
[41.1, 46.6]

[38.8, 44.2]
?
[47.8, 53.0]
?
2631
[40.9, 46.4]
2427
[40.9, 46.5]
1186
[40.7, 46.1]
1025
[40.7, 46.2]
685
[40.5, 45.9]
641
[40.5, 45.9]
mk-en
[42.4, 46.3]

[27.1, 30.8]
1385
[28.8, 32.6]
?
1698
[27.8, 31.5]
1662
[27.8, 31.4]
1321
[27.8, 31.4]
1285
[27.8, 31.4]
1186
[27.7, 31.4]
1180
[27.7, 31.4]

BLEU[edit]

Pair freq tlm ling alig rules
(c>1.5)
rules
(c>2.0)
rules
(c>2.5)
rules
(c>3.0)
rules
(c>3.5)
rules
(c>4.0)
br-fr
[0.1247, 0.1420]

[0.1397, 0.1572]
168
[0.1325, 0.1503]
115
261
[0.1250, 0.1425]
247
[0.1252, 0.1429]
172
[0.1240, 0.1412]
165
[0.1243, 0.1416]
138
[0.1255, 0.1429]
136
[0.1255, 0.1429]
en-es
[0.2151, 0.2340]

[0.2197, 0.2384]
667
[0.2148, 0.2337]
?
2595
[0.2180, 0.2371]
2436
[0.2180, 0.2372]
1520
[0.2190, 0.2380]
1402
[0.2190, 0.2381]
1065
[0.2189, 0.2380]
1024
[0.2189, 0.2380]
eu-es
[0.0873, 0.1038]

[0.0921, 0.1093]
?
[0.0870, 0.1030]
?
2631
[0.0875, 0.1040]
2427
[0.0878, 0.1042]
1186
[0.0878, 0.1043]
1025
[0.0878, 0.1043]
685
[0.0879, 0.1043]
641
[0.0879, 0.1043]
mk-en
[0.2300, 0.2511]

[0.2976, 0.3230]
1385
[0.2567, 0.2798]
1698
[0.2694, 0.2930]
1662
[0.2695, 0.2931]
1321
[0.2696, 0.2935]
1285
[0.2696, 0.2935]
1186
[0.2696, 0.2934]
1180
[0.2696, 0.2934]


Learning monolingually (fractional counts)[edit]

Setup:

  • SL side of the training corpus
  • All possibilities translated and scored
  • Probabilities normalised into fractional counts (e.g. add them up to get a total, then divide each prob by the total).
    • log prob converted into normal prob using exp10()
  • Rules generated by counting fractions from the translated file.

In domain[edit]

LER[edit]

BLEU[edit]

Out of domain[edit]

LER[edit]

BLEU[edit]

MaxEnt[edit]

With alignments[edit]

Pair alig rule-best ME (>5) ME (>3)
br-fr 33.4 31.5 31.8 29.9
mk-en 19.9 19.8 18.9 17.8
eu-es 18.5 17.9 17.4 19.9
en-es 8.6 7.0 6.3 6.3

With fractional counts[edit]

Pair alig rule-best ME (>5) ME (>3) ME (>1) ME (>0)
br-fr 43.4 43.1 61.9 46.2 48.2 49.9
mk-en 29.5
eu-es 41.2 43.9 44.4
en-es 11.9 11.7 11.4 11.9

Notes[edit]