Difference between revisions of "User:Francis Tyers/Experiments"
Jump to navigation
Jump to search
(→LER) |
|||
(49 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
** (partial) answer: preposition selection is much better. We could try running with ling-default preps. |
** (partial) answer: preposition selection is much better. We could try running with ling-default preps. |
||
* Do pairwise bootstrap resampling for each of best baseline + best rules |
* Do pairwise bootstrap resampling for each of best baseline + best rules |
||
** (done) for parallel |
|||
* why do breton numbers for monolingual rules not approach TLM ? |
|||
* why do eu-es rules not improve over freq ? |
* <s>why do eu-es rules not improve over freq ?</s> |
||
** (partial) answer: some rules do not apply because of tag wankery. See line #129774 in the test corpus. Need to define better how tags work. Perhaps only include tags where ambiguous ? |
|||
* <s>why do breton numbers for monolingual rules not approach TLM ? </s> |
|||
** because of crispiness being too low. |
|||
* <s>why when we add more data, do the results get worse ? </s> |
|||
** because of crispiness being too low. |
|||
* rerun the mk-en stuff with frac counts. |
|||
* run br-fr test with huge data. |
|||
* try decreasing the C with corpus size. |
|||
==Corpus stats== |
==Corpus stats== |
||
Line 94: | Line 102: | ||
|- |
|- |
||
| en-es || <small>—</small><br/>[21.0, 25.3] || <small>—</small><br/>[15.1, 18.9] || <small>667</small><br/>[20.7, 25.1 || <small>630</small><br/>[ |
| en-es || <small>—</small><br/>[21.0, 25.3] || <small>—</small><br/>[15.1, 18.9] || <small>667</small><br/>[20.7, 25.1 || <small>630</small><br/>[7.2, 10.0] || <small>2881</small><br/>[5.9, 8.6] || <small>2728</small><br/>[6.0, 8.6] || <small>1683</small><br/>'''[5.7, 8.3]''' || <small>1578</small><br/>'''[5.7, 8.3]''' || <small>1242</small><br/>[6.0, 8.5] || <small>1197</small><br/>[5.9, 8.6] |
||
|- |
|- |
||
| eu-es || <small>—</small><br/>[41.1, 46.6] || <small>—</small><br/>[38.8, 44.2] || <small>697</small><br/>[47.8, 53.0] || <small>598</small><br/>[16.5, 20.8] || <small>2253</small><br/>[20.2, 24.7] || <small>2088</small><br/>[17.2, 21.7] || <small>1382</small><br/>[16.8, 21.0] || <small>1266</small><br/>[16.1, 20.4] || <small>1022</small><br/>'''[15.9, 20.2]''' ||<small>995</small><br/>[16.0, 20.3] |
| eu-es || <small>—</small><br/>[41.1, 46.6] || <small>—</small><br/>[38.8, 44.2] || <small>697</small><br/>[47.8, 53.0] || <small>598</small><br/>[16.5, 20.8] || <small>2253</small><br/>[20.2, 24.7] || <small>2088</small><br/>[17.2, 21.7] || <small>1382</small><br/>[16.8, 21.0] || <small>1266</small><br/>[16.1, 20.4] || <small>1022</small><br/>'''[15.9, 20.2]''' ||<small>995</small><br/>[16.0, 20.3] |
||
Line 119: | Line 127: | ||
</small> |
</small> |
||
==Learning monolingually== |
==Learning monolingually (winner-takes-all)== |
||
Setup: |
Setup: |
||
Line 159: | Line 167: | ||
| br-fr || <small>—</small><br/>[58.9, 64.8] || <small>—</small><br/>'''[44.2, 50.5]''' || <small>168</small><br/>[54.8, 60.7] || <small>115</small><br/> || <small>261</small><br/>[53.5, 59.2] ||align="center"| <small>247</small><br/>[52.1, 58.2] || <small>172</small><br/>[54.3, 60.2] || <small>165</small><br/>[52.7, 58.4] || <small>138</small><br/>'''[50.5, 56.3]''' || <small>136</small><br/>[50.6, 56.6] |
| br-fr || <small>—</small><br/>[58.9, 64.8] || <small>—</small><br/>'''[44.2, 50.5]''' || <small>168</small><br/>[54.8, 60.7] || <small>115</small><br/> || <small>261</small><br/>[53.5, 59.2] ||align="center"| <small>247</small><br/>[52.1, 58.2] || <small>172</small><br/>[54.3, 60.2] || <small>165</small><br/>[52.7, 58.4] || <small>138</small><br/>'''[50.5, 56.3]''' || <small>136</small><br/>[50.6, 56.6] |
||
|- |
|- |
||
| en-es || <small>—</small><br/>[21.0, 25.3] || <small>—</small><br/>[15.1, 18.9] || <small>667</small><br/>[20.7, 25.1] || <small>?</small><br/>? || <small>2595</small><br/>[15.0, 19.0] || <small>2436</small><br/>[15.1, 19.1] || <small>1520</small><br/>[13.7, 17.6] || <small>1402</small><br/>[13.6, 17.3] || <small>1065</small><br/> || <small>1024</small><br/> |
| en-es || <small>—</small><br/>[21.0, 25.3] || <small>—</small><br/>[15.1, 18.9] || <small>667</small><br/>[20.7, 25.1] || <small>?</small><br/>? || <small>2595</small><br/>[15.0, 19.0] || <small>2436</small><br/>[15.1, 19.1] || <small>1520</small><br/>[13.7, 17.6] || <small>1402</small><br/>'''[13.6, 17.3]''' || <small>1065</small><br/>[13.9, 17.7] || <small>1024</small><br/>[13.9, 17.8] |
||
|- |
|- |
||
| eu-es || <small>—</small><br/>[41.1, 46.6] || <small>—</small><br/>'''[38.8, 44.2]''' || <small>?</small><br/>[47.8, 53.0] || <small>?</small><br/> || <small>2631</small><br/>[40.9, 46.4] || <small>2427</small><br/>[40.9, 46.5] || <small>1186</small><br/>[40.7, 46.1] || <small>1025</small><br/>[40.7, 46.2] || <small>685</small><br/>'''[40.5, 45.9]''' || <small>641</small><br/>'''[40.5, 45.9]''' |
| eu-es || <small>—</small><br/>[41.1, 46.6] || <small>—</small><br/>'''[38.8, 44.2]''' || <small>?</small><br/>[47.8, 53.0] || <small>?</small><br/> || <small>2631</small><br/>[40.9, 46.4] || <small>2427</small><br/>[40.9, 46.5] || <small>1186</small><br/>[40.7, 46.1] || <small>1025</small><br/>[40.7, 46.2] || <small>685</small><br/>'''[40.5, 45.9]''' || <small>641</small><br/>'''[40.5, 45.9]''' |
||
Line 185: | Line 193: | ||
</small> |
</small> |
||
==Learning monolingually (fractional counts)== |
|||
==Processing== |
|||
Setup: |
|||
===Basque→Spanish=== |
|||
* SL side of the training corpus |
|||
<pre> |
|||
* All possibilities translated and scored |
|||
2081 cat europako_testuak_memoria_2010.tmx | iconv -f utf-16 -t utf-8 > europako_testuak_memoria_2010.tmx.u8 |
|||
* Probabilities normalised into fractional counts (e.g. add them up to get a total, then divide each prob by the total). |
|||
2082 cat 2010_memo_orokorra.tmx | iconv -f utf-16 -t utf-8 > 2010_memo_orokorra.tmx.u8 |
|||
** log prob converted into normal prob using exp10() |
|||
2088 python3 process-tmx.py europako_testuak_memoria_2010.tmx.u8 > europako_testuak_memoria_2010.txt |
|||
* Rules generated by counting fractions from the translated file. |
|||
2090 python3 process-tmx.py 2010_memo_orokorra.tmx.u8 > 2010_memo_orokorra.txt |
|||
2091 cat 2010_memo_orokorra.txt | grep '^es' | cut -f2- > 2010_memo_orokorra.es.txt |
|||
2092 cat 2010_memo_orokorra.txt | grep '^eu' | cut -f2- > 2010_memo_orokorra.eu.txt |
|||
2094 cat europako_testuak_memoria_2010.txt | grep '^es' | cut -f2- > europako_testuak_memoria_2010.es.txt |
|||
2095 cat europako_testuak_memoria_2010.txt | grep '^eu' | cut -f2- > europako_testuak_memoria_2010.eu.txt |
|||
2099 cat europako_testuak_memoria_2010.es.txt 2010_memo_orokorra.es.txt > opendata.es |
|||
2100 cat europako_testuak_memoria_2010.eu.txt 2010_memo_orokorra.eu.txt > opendata.eu |
|||
===In domain=== |
|||
====LER==== |
|||
$ wc -l opendata.e* |
|||
782325 opendata.es |
|||
782325 opendata.eu |
|||
====BLEU==== |
|||
===Out of domain=== |
|||
2114 perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl opendata eu es opendata.clean 1 40 |
|||
====LER==== |
|||
2117 cat opendata.clean.eu |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ eu-es-pretransfer > opendata.tagged.eu |
|||
2126 cat opendata.clean.es |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ es-eu-pretransfer > opendata.tagged.es & |
|||
====BLEU==== |
|||
==MaxEnt== |
|||
2132 seq 1 771238 > opendata.lines |
|||
2133 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f1 > opendata.lines.new |
|||
2134 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f2 > opendata.tagged.eu.new |
|||
2135 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f3 > opendata.tagged.es.new |
|||
===With alignments=== |
|||
2137 mv opendata.lines.new opendata.lines |
|||
2138 mv opendata.tagged.es.new opendata.tagged.es |
|||
2139 mv opendata.tagged.eu.new opendata.tagged.eu |
|||
{|class=wikitable |
|||
2146 cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil.bin >/tmp/eu-es.bil1 |
|||
! Pair || alig || rule-best || ME (>5) || ME (>3) |
|||
|- |
|||
| br-fr || 33.4 || 31.5 || 31.8 || 29.9 |
|||
|- |
|||
| mk-en || 19.9 || 19.8 || 18.9 || 17.8 |
|||
|- |
|||
| eu-es || 18.5 || 17.9 || 17.4 || 19.9 |
|||
|- |
|||
| en-es || 8.6 || 7.0 || 6.3 || 6.3 |
|||
|- |
|||
|} |
|||
===With fractional counts=== |
|||
2148 cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil-noRL.bin >/tmp/eu-es.bil2 |
|||
$ tail -n 1 /tmp/*.poly |
|||
==> /tmp/eu-es.bil1.poly <== |
|||
1.00240014637 |
|||
==> /tmp/eu-es.bil2.poly <== |
|||
1.3015831681 |
|||
2191 mv /tmp/eu-es.bil2 opendata.biltrans.eu-es |
|||
2258 cat opendata.tagged.es | python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py es > opendata.token.es |
|||
2007 cat opendata.tagged.eu | python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py eu > opendata.token.eu |
|||
2014 cat opendata.biltrans.eu-es | python /home/fran/source/apertium-lex-tools/scripts/process-biltrans-output.py > opendata.token.eu-es & |
|||
$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \ |
|||
/home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus opendata.token -f eu -e es -alignment grow-diag-final-and \ |
|||
-reordering msd-bidirectional-fe -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 & |
|||
2011 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f1 > opendata.lines.new& |
|||
2013 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f2 > opendata.eu.new & |
|||
2014 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f3 > opendata.es.new & |
|||
2017 mv opendata.lines.new opendata.lines |
|||
2018 mv opendata.es.new opendata.token.es |
|||
2019 mv opendata.eu.new opendata.token.eu |
|||
2032 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f1 > opendata.lines.new |
|||
2033 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f2 > opendata.eu.new & |
|||
2034 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f3 > opendata.es.new & |
|||
2035 mv opendata.lines.new opendata.lines |
|||
2036 mv opendata.es.new opendata.token.es |
|||
2037 mv opendata.eu.new opendata.token.eu |
|||
2055 cat opendata.token.es | sed 's/ *$//g' > opendata.token.es.new |
|||
2056 cat opendata.token.eu | sed 's/ *$//g' > opendata.token.eu.new |
|||
2057 mv opendata.token.es.new opendata.token.es |
|||
2058 mv opendata.token.eu.new opendata.token.eu |
|||
</pre> |
|||
===English→Spanish=== |
|||
<pre> |
|||
2114 perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl en es europarl.clean 1 40 |
|||
2056 cat europarl.clean.en | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.tagged.en & |
|||
2057 cat europarl.clean.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es es-en-pretransfer > europarl.tagged.es & |
|||
2073 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f1 > europarl.lines.new |
|||
2074 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.en.new |
|||
2075 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.es.new |
|||
2087 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f1 >europarl.lines.new |
|||
2088 bg |
|||
2089 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f2 >europarl.en.new& |
|||
2090 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f3 >europarl.es.new& |
|||
2097 nohup cat europarl.tagged.en | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py en > europarl.token.en & |
|||
2098 nohup cat europarl.tagged.es | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py es > europarl.token.es & |
|||
2099 nohup cat europarl.biltrans.en-es | python ~/source/apertium-lex-tools/scripts/process-biltrans-output.py > europarl.token.en-es & |
|||
</pre> |
|||
===Macedonian→English=== |
|||
<pre> |
|||
:%s/еfу/еѓу/g |
|||
:%s/аfа/аѓа/g |
|||
:%s/оfа/оѓа/g |
|||
:%s/уfе/уѓе/g |
|||
:%s/нfи/нѓи/g |
|||
:%s/Ѓиниfиќ/Ѓинѓиќ/g |
|||
:%s/еfе/еѓе/g |
|||
:%s/уfм/уѓм/g |
|||
:%s/рfи/рѓи/g |
|||
:%s/ fе / ѓе /g |
|||
:%s/рfе/рѓе/g |
|||
:%s/уfи/уѓи/g |
|||
:%s/ fу/ ѓу/g |
|||
:%s/Караfорѓевиќ/Караѓорѓевиќ/g |
|||
:%s/Холанfанец/Холанѓанец/g |
|||
:%s/реfаваат/реѓаваат/g |
|||
:%s/Швеfанката/Швеѓанката/g |
|||
:%s/Новозеланfани/Новозеланѓани/g |
|||
:%s/Мрfан/Мрѓан/g |
|||
:%s/Анfелка/Анѓелка/g |
|||
:%s/рfосаната/рѓосаната/g |
|||
:%s/оттуfуваоето/оттуѓуваоето/g |
|||
:%s/Енfел/Енѓел/g |
|||
:%s/Караfорѓевиќ/Караѓорѓевиќ/g |
|||
:%s/маfународната/маѓународната/g |
|||
:%s/Пеfа/Пеѓа/g |
|||
:%s/маfепсник/маѓепсник/g |
|||
:%s/Караfорѓе/Караѓорѓе/g |
|||
:%s/Граfевинар/Граѓевинар/g |
|||
:%s/Меfаши/Меѓаши/g |
|||
:%s/Ванfел/Ванѓел/g |
|||
:%s/Караfиќ/Караѓиќ/g |
|||
:%s/Анfели/Анѓели/g |
|||
:%s/саfи/саѓи/g |
|||
:%s/маfионичарски/маѓионичарски/g |
|||
:%s/Караfорѓевиќ/Караѓорѓевиќ/g |
|||
:%s/панаfур/панаѓур/g |
|||
:%s/Ѓерf/Ѓерѓ/g |
|||
:%s/Ѓинѓиf/Ѓинѓиѓ/g |
|||
2042 paste setimes.mk setimes.en| grep -v '^(' | cut -f1 > setimes.mk.new |
|||
2043 paste setimes.mk setimes.en| grep -v '^(' | cut -f2 > setimes.en.new |
|||
2044 paste setimes.en.new setimes.mk.new | grep -v '^(' | cut -f1 > setimes.en |
|||
2045 paste setimes.en.new setimes.mk.new | grep -v '^(' | cut -f2 > setimes.mk |
|||
perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl setimes mk en setimes.clean 1 40 |
|||
2052 cat setimes.clean.mk | apertium-destxt | apertium -f none -d ~/source/apertium-mk-en/ mk-en-pretransfer > setimes.tagged.mk& |
|||
2054 cat setimes.clean.en | apertium-destxt | apertium -f none -d ~/source/apertium-mk-en/ en-mk-pretransfer > setimes.tagged.en& |
|||
2063 seq 1 190503 > setimes.lines |
|||
2064 paste setimes.lines setimes.tagged.mk setimes.tagged.en | grep '<' | cut -f1 > setimes.lines.new |
|||
2065 paste setimes.lines setimes.tagged.mk setimes.tagged.en | grep '<' | cut -f2 > setimes.mk.new |
|||
2066 paste setimes.lines setimes.tagged.mk setimes.tagged.en | grep '<' | cut -f3 > setimes.en.new |
|||
2067 mv setimes.en.new setimes.tagged.en |
|||
2068 mv setimes.mk.new setimes.tagged.mk |
|||
2069 mv setimes.lines.new setimes.lines |
|||
2077 nohup cat setimes.tagged.mk | lt-proc -b ~/source/apertium-mk-en/mk-en.autobil.bin > setimes.biltrans.mk-en & |
|||
2122 cat setimes.tagged.mk | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py mk > setimes.token.mk & |
|||
2123 cat setimes.tagged.en | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py en > setimes.token.en & |
|||
{|class=wikitable |
|||
! Pair || alig || rule-best || ME (>5) || ME (>3) || ME (>1) || ME (>0) |
|||
|- |
|||
| br-fr || 43.4 || 43.1 || 61.9 || 46.2 || 48.2 || 49.9 |
|||
|- |
|||
| mk-en || 29.5 || || || || |
|||
|- |
|||
| eu-es || 41.2 || || 43.9 || 44.4 || |
|||
|- |
|||
| en-es || 11.9 || 11.7 || 11.4 || 11.9 || |
|||
|- |
|||
|} |
|||
==Notes== |
|||
* [http://acl.ldc.upenn.edu/W/W07/W07-0733.pdf Experiments in Domain Adaptation for Statistical Machine Translation] |
|||
</pre> |
|||
* [http://www.cs.sfu.ca/~anoop/papers/pdf/ssl-smt-mtjournal07.pdf Semi-supervised model adaptation for statistical machine translation] |
|||
* [http://www.mt-archive.info/WMT-2009-Bertoldi.pdf Domain Adaptation for Statistical Machine Translation with Monolingual Resources] |
|||
*: "We found that the largest gain (25% relative) is achieved when in-domain data are available for the target language. A smaller performance improvement is still observed (5% relative) if source adaptation data are available. We also observed that the most important role is played by the LM adaptation, while the adaptation of the TM and RM gives consistent but small improvement." |
Latest revision as of 10:49, 22 November 2012
TODO[edit]
Do LER in/out domain testing for the en-es setup with news commentary.Do BLEU in/out domain testing for the en-es setup with news commentary.mk-en: why is TLM LER/BLEU so much better ?- (partial) answer: 0-context rules (e.g. defaults) not applying properly. Fixed by running in series. This "solves" the LER issue.
- (partial) answer: preposition selection is much better. We could try running with ling-default preps.
- Do pairwise bootstrap resampling for each of best baseline + best rules
- (done) for parallel
why do eu-es rules not improve over freq ?- (partial) answer: some rules do not apply because of tag wankery. See line #129774 in the test corpus. Need to define better how tags work. Perhaps only include tags where ambiguous ?
why do breton numbers for monolingual rules not approach TLM ?- because of crispiness being too low.
why when we add more data, do the results get worse ?- because of crispiness being too low.
- rerun the mk-en stuff with frac counts.
- run br-fr test with huge data.
- try decreasing the C with corpus size.
Corpus stats[edit]
Pair | Corpus | Lines | W. (src) | SL cov. | Extracted | Extracted (%) | L. (train) | L. (test) | L (dev) | Uniq. tokens >1 trad. | Avg. trad / word | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
br-fr | oab | 57,305 | 702,328 | 94.47% | 4,668 | 8.32 | 2,668 | 1,000 | 1,000 | 603 | 1.07 | |
en-es | europarl | 1,467,708 | 30,154,098 | 98.08% | 312,162 | 22.18 | 310,162 | 1,000 | 1,000 | 2,082 | 1.08 | |
eu-es | opendata.euskadi.net | 765,115 | 10,190,079 | 91.70% | 87,907 | 11.48 | 85,907 | 1,000 | 1,000 | 1,806 | 1.30 | |
mk-en | setimes | 190,493 | 4,259,338 | 92.17% | 19,747 | 10.94 | 17,747 | 1,000 | 1,000 | 13,134 | 1.86 | |
sh-mk | setimes |
Evaluation corpus[edit]
Out of domain[edit]
Pair | Lines | Words (L1) | Words (L2) | Ambig. tokens | Ambig. types | Ambig token/type | % ambig | Av. trad/word |
---|---|---|---|---|---|---|---|---|
en-es | 434 | 9,463 | 10,280 | 619 | 303 | 2.04 | 6.54% | - |
In domain[edit]
Pair | Lines | Words (L1) | Words (L2) | Ambig. tokens | Ambig. types | Ambig token/type | % ambig | Av. trad/word |
---|---|---|---|---|---|---|---|---|
br-fr | 1,000 | 13,854 | 13,878 | 1,163 | 372 | 3.13 | 8.39% | - |
en-es | 1,000 | 19,882 | 20,944 | 1,469 | 337 | 4.35 | 7.38% | - |
eu-es | 1,000 | 7,967 | 11,476 | 1,360 | 412 | 3.30 | 17.07% | - |
mk-en | 1,000 | 13,441 | 14,228 | 3,872 | 1,289 | 3.00 | 28.80% | - |
- % ambig = number of SL tokens with >1 translation
EAMT-style results[edit]
Out of domain[edit]
LER[edit]
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | — [44.5, 52.0] |
— [34.7, 41.9] |
667 [24.7, 31.9] |
630 [ 21.4 , 28.4 ] |
2881 [20.2, 27.2] |
2728 [20.2, 27.2] |
1683 [20.7, 27.6] |
1578 [20.7, 27.6] |
1242 [20.7, 27.6] |
1197 [20.7, 27.6] |
BLEU[edit]
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | [0.1885, 0.2133] | [0.1953, 0.2201] | [0.1832, 0.2067] | [0.1832, 0.2067] | [0.1831, 0.2067] | [0.1830, 0.2067] | [ [0.1828, 0.2063] | [0.1828, 0.2063] | [0.1828, 0.2063] |
In domain[edit]
LER[edit]
is the "crispiness" ratio, the amount of times an alternative translation is seen in a given context compared to the default translation. So, a of 2.0 means that the translation appears twice as frequently as the default.
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [58.9, 64.8] |
— [44.2, 50.5] |
168 [54.8, 60.7] |
115 [28.5, 34.1] |
221 [27.8, 33.3] |
213 [27.6, 33.0] |
159 [26.3, 31.8] |
150 [26.1, 31.6] |
135 [27.2, 32.8] |
135 [27.2, 32.8] |
en-es | — [21.0, 25.3] |
— [15.1, 18.9] |
667 [20.7, 25.1 |
630 [7.2, 10.0] |
2881 [5.9, 8.6] |
2728 [6.0, 8.6] |
1683 [5.7, 8.3] |
1578 [5.7, 8.3] |
1242 [6.0, 8.5] |
1197 [5.9, 8.6] |
eu-es | — [41.1, 46.6] |
— [38.8, 44.2] |
697 [47.8, 53.0] |
598 [16.5, 20.8] |
2253 [20.2, 24.7] |
2088 [17.2, 21.7] |
1382 [16.8, 21.0] |
1266 [16.1, 20.4] |
1022 [15.9, 20.2] |
995 [16.0, 20.3] |
mk-en | — [42.4, 46.3] |
— [27.1, 30.8] |
1385 [28.8, 32.6] |
1079 [19.0, 22.2] |
1684 [18.5, 21.5] |
1635 [18.6, 21.6] |
1323 [19.1, 22.2] |
1271 [19.0, 22.0] |
1198 [19.1, 22.1] |
1079 [19.1, 22.1] |
BLEU[edit]
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [0.1247, 0.1420] |
— [0.1397, 0.1572] |
168 [0.1325, 0.1503] |
115 [0.1344, 0.1526] |
221 [0.1367, 0.1551] |
213 [0.1367, 0.1549] |
159 [0.1374, 0.1554] |
150 [0.1364, 0.1543] |
135 [0.1352, 0.1535] |
135 [0.1352, 0.1535] |
en-es | — [0.2151, 0.2340] |
— [0.2197, 0.2384] |
667 [0.2148, 0.2337] |
630 [0.2208, 0.2398] |
2881 [0.2217, 0.2405] |
2728 [0.2217, 0.2406] |
1683 [0.2217, 0.2407] |
1578 [0.2217, 0.2407] |
1242 [0.2217, 0.2407] |
1197 [0.2217, 0.2408] |
eu-es | — [0.0873, 0.1038] |
— [0.0921, 0.1093] |
697 [0.0870, 0.1030] |
598 [0.0972, 0.1149] |
2253 [0.0965, 0.1142] |
2088 [0.0971, 0.1147] |
1382 [0.0971, 0.1148] |
1266 [0.0971, 0.1148] |
1022 [0.0973, 0.1150] |
995 [0.0973, 0.1150] |
mk-en | — [0.2300, 0.2511] |
— [0.2976, 0.3230] |
1385 [0.2337, 0.2563] |
1079 [0.2829, 0.3064] |
1684 [0.2838, 0.3071] |
1635 [0.2834, 0.3067] |
1323 [0.2825, 0.3058] |
1271 [0.2827, 0.3059] |
1198 [0.2827, 0.3059] |
1079 |
Learning monolingually (winner-takes-all)[edit]
Setup:
- SL side of the training corpus
- All possibilities translated and scored
- Absolute winners taken
- Rules generated by counting ngrams in the same way as with the parallel corpus, only no alignment needed as it works like an annotated corpus.
Out of domain[edit]
LER[edit]
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | [44.5, 52.0] | [34.7, 41.9] | [24.7, 31.9] | [30.2, 37.9] | [30.2, 37.9] | [29.2, 37.0] | [29.3, 36.8] | [29.0, 36.4] | [29.1, 36.5] |
BLEU[edit]
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | [0.1885, 0.2133] | [0.1953, 0.2201] | [0.1832, 0.2067] | [0.1806, 0.2042] | [0.1806, 0.2042] | [0.1808, 0.2043] | [0.1810, 0.2046] | [0.1809, 0.2045] | [0.1809, 0.2045] |
In domain[edit]
LER[edit]
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [58.9, 64.8] |
— [44.2, 50.5] |
168 [54.8, 60.7] |
115 |
261 [53.5, 59.2] |
247 [52.1, 58.2] |
172 [54.3, 60.2] |
165 [52.7, 58.4] |
138 [50.5, 56.3] |
136 [50.6, 56.6] |
en-es | — [21.0, 25.3] |
— [15.1, 18.9] |
667 [20.7, 25.1] |
? ? |
2595 [15.0, 19.0] |
2436 [15.1, 19.1] |
1520 [13.7, 17.6] |
1402 [13.6, 17.3] |
1065 [13.9, 17.7] |
1024 [13.9, 17.8] |
eu-es | — [41.1, 46.6] |
— [38.8, 44.2] |
? [47.8, 53.0] |
? |
2631 [40.9, 46.4] |
2427 [40.9, 46.5] |
1186 [40.7, 46.1] |
1025 [40.7, 46.2] |
685 [40.5, 45.9] |
641 [40.5, 45.9] |
mk-en | — [42.4, 46.3] |
— [27.1, 30.8] |
1385 [28.8, 32.6] |
? |
1698 [27.8, 31.5] |
1662 [27.8, 31.4] |
1321 [27.8, 31.4] |
1285 [27.8, 31.4] |
1186 [27.7, 31.4] |
1180 [27.7, 31.4] |
BLEU[edit]
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [0.1247, 0.1420] |
— [0.1397, 0.1572] |
168 [0.1325, 0.1503] |
115 |
261 [0.1250, 0.1425] |
247 [0.1252, 0.1429] |
172 [0.1240, 0.1412] |
165 [0.1243, 0.1416] |
138 [0.1255, 0.1429] |
136 [0.1255, 0.1429] |
en-es | — [0.2151, 0.2340] |
— [0.2197, 0.2384] |
667 [0.2148, 0.2337] |
? |
2595 [0.2180, 0.2371] |
2436 [0.2180, 0.2372] |
1520 [0.2190, 0.2380] |
1402 [0.2190, 0.2381] |
1065 [0.2189, 0.2380] |
1024 [0.2189, 0.2380] |
eu-es | — [0.0873, 0.1038] |
— [0.0921, 0.1093] |
? [0.0870, 0.1030] |
? |
2631 [0.0875, 0.1040] |
2427 [0.0878, 0.1042] |
1186 [0.0878, 0.1043] |
1025 [0.0878, 0.1043] |
685 [0.0879, 0.1043] |
641 [0.0879, 0.1043] |
mk-en | — [0.2300, 0.2511] |
— [0.2976, 0.3230] |
1385 [0.2567, 0.2798] |
1698 [0.2694, 0.2930] |
1662 [0.2695, 0.2931] |
1321 [0.2696, 0.2935] |
1285 [0.2696, 0.2935] |
1186 [0.2696, 0.2934] |
1180 [0.2696, 0.2934] |
Learning monolingually (fractional counts)[edit]
Setup:
- SL side of the training corpus
- All possibilities translated and scored
- Probabilities normalised into fractional counts (e.g. add them up to get a total, then divide each prob by the total).
- log prob converted into normal prob using exp10()
- Rules generated by counting fractions from the translated file.
In domain[edit]
LER[edit]
BLEU[edit]
Out of domain[edit]
LER[edit]
BLEU[edit]
MaxEnt[edit]
With alignments[edit]
Pair | alig | rule-best | ME (>5) | ME (>3) |
---|---|---|---|---|
br-fr | 33.4 | 31.5 | 31.8 | 29.9 |
mk-en | 19.9 | 19.8 | 18.9 | 17.8 |
eu-es | 18.5 | 17.9 | 17.4 | 19.9 |
en-es | 8.6 | 7.0 | 6.3 | 6.3 |
With fractional counts[edit]
Pair | alig | rule-best | ME (>5) | ME (>3) | ME (>1) | ME (>0) |
---|---|---|---|---|---|---|
br-fr | 43.4 | 43.1 | 61.9 | 46.2 | 48.2 | 49.9 |
mk-en | 29.5 | |||||
eu-es | 41.2 | 43.9 | 44.4 | |||
en-es | 11.9 | 11.7 | 11.4 | 11.9 |
Notes[edit]
- Experiments in Domain Adaptation for Statistical Machine Translation
- Semi-supervised model adaptation for statistical machine translation
- Domain Adaptation for Statistical Machine Translation with Monolingual Resources
- "We found that the largest gain (25% relative) is achieved when in-domain data are available for the target language. A smaller performance improvement is still observed (5% relative) if source adaptation data are available. We also observed that the most important role is played by the LM adaptation, while the adaptation of the TM and RM gives consistent but small improvement."