User:Francis Tyers/Experiments
Jump to navigation
Jump to search
TODO
Do LER in/out domain testing for the en-es setup with news commentary.Do BLEU in/out domain testing for the en-es setup with news commentary.mk-en: why is TLM LER/BLEU so much better ?- (partial) answer: 0-context rules (e.g. defaults) not applying properly. Fixed by running in series. This "solves" the LER issue.
- (partial) answer: preposition selection is much better. We could try running with ling-default preps.
- Do pairwise bootstrap resampling for each of best baseline + best rules
- why do breton numbers for monolingual rules not approach TLM ?
- why do eu-es rules not improve over freq ?
Corpus stats
Pair | Corpus | Lines | W. (src) | SL cov. | Extracted | Extracted (%) | L. (train) | L. (test) | L (dev) | Uniq. tokens >1 trad. | Avg. trad / word | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
br-fr | oab | 57,305 | 702,328 | 94.47% | 4,668 | 8.32 | 2,668 | 1,000 | 1,000 | 603 | 1.07 | |
en-es | europarl | 1,467,708 | 30,154,098 | 98.08% | 312,162 | 22.18 | 310,162 | 1,000 | 1,000 | 2,082 | 1.08 | |
eu-es | opendata.euskadi.net | 765,115 | 10,190,079 | 91.70% | 87,907 | 11.48 | 85,907 | 1,000 | 1,000 | 1,806 | 1.30 | |
mk-en | setimes | 190,493 | 4,259,338 | 92.17% | 19,747 | 10.94 | 17,747 | 1,000 | 1,000 | 13,134 | 1.86 | |
sh-mk | setimes |
Evaluation corpus
Out of domain
Pair | Lines | Words (L1) | Words (L2) | Ambig. tokens | Ambig. types | Ambig token/type | % ambig | Av. trad/word |
---|---|---|---|---|---|---|---|---|
en-es | 434 | 9,463 | 10,280 | 619 | 303 | 2.04 | 6.54% | - |
In domain
Pair | Lines | Words (L1) | Words (L2) | Ambig. tokens | Ambig. types | Ambig token/type | % ambig | Av. trad/word |
---|---|---|---|---|---|---|---|---|
br-fr | 1,000 | 13,854 | 13,878 | 1,163 | 372 | 3.13 | 8.39% | - |
en-es | 1,000 | 19,882 | 20,944 | 1,469 | 337 | 4.35 | 7.38% | - |
eu-es | 1,000 | 7,967 | 11,476 | 1,360 | 412 | 3.30 | 17.07% | - |
mk-en | 1,000 | 13,441 | 14,228 | 3,872 | 1,289 | 3.00 | 28.80% | - |
- % ambig = number of SL tokens with >1 translation
EAMT-style results
Out of domain
LER
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | — [44.5, 52.0] |
— [34.7, 41.9] |
667 [24.7, 31.9] |
630 [ 21.4 , 28.4 ] |
2881 [20.2, 27.2] |
2728 [20.2, 27.2] |
1683 [20.7, 27.6] |
1578 [20.7, 27.6] |
1242 [20.7, 27.6] |
1197 [20.7, 27.6] |
BLEU
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | [0.1885, 0.2133] | [0.1953, 0.2201] | [0.1832, 0.2067] | [0.1832, 0.2067] | [0.1831, 0.2067] | [0.1830, 0.2067] | [ [0.1828, 0.2063] | [0.1828, 0.2063] | [0.1828, 0.2063] |
In domain
LER
is the "crispiness" ratio, the amount of times an alternative translation is seen in a given context compared to the default translation. So, a of 2.0 means that the translation appears twice as frequently as the default.
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [58.9, 64.8] |
— [44.2, 50.5] |
168 [54.8, 60.7] |
115 [28.5, 34.1] |
221 [27.8, 33.3] |
213 [27.6, 33.0] |
159 [26.3, 31.8] |
150 [26.1, 31.6] |
135 [27.2, 32.8] |
135 [27.2, 32.8] |
en-es | — [21.0, 25.3] |
— [15.1, 18.9] |
667 [20.7, 25.1 |
630 [9.7, 12.7] |
2881 [8.6, 11.6] |
2728 [8.6, 11.6] |
1683 [8.3, 11.4] |
1578 [8.3, 11.4] |
1242 [8.5, 11.6] |
1197 [8.5, 11.5] |
eu-es | — [41.1, 46.6] |
— [38.8, 44.2] |
697 [47.8, 53.0] |
598 [16.5, 20.8] |
2253 [20.2, 24.7] |
2088 [17.2, 21.7] |
1382 [16.8, 21.0] |
1266 [16.1, 20.4] |
1022 [15.9, 20.2] |
995 [16.0, 20.3] |
mk-en | — [42.4, 46.3] |
— [27.1, 30.8] |
1385 [28.8, 32.6] |
1079 [19.0, 22.2] |
1684 [18.5, 21.5] |
1635 [18.6, 21.6] |
1323 [19.1, 22.2] |
1271 [19.0, 22.0] |
1198 [19.1, 22.1] |
1079 [19.1, 22.1] |
BLEU
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [0.1247, 0.1420] |
— [0.1397, 0.1572] |
168 [0.1325, 0.1503] |
115 [0.1344, 0.1526] |
221 [0.1367, 0.1551] |
213 [0.1367, 0.1549] |
159 [0.1374, 0.1554] |
150 [0.1364, 0.1543] |
135 [0.1352, 0.1535] |
135 [0.1352, 0.1535] |
en-es | — [0.2151, 0.2340] |
— [0.2197, 0.2384] |
667 [0.2148, 0.2337] |
630 [0.2208, 0.2398] |
2881 [0.2217, 0.2405] |
2728 [0.2217, 0.2406] |
1683 [0.2217, 0.2407] |
1578 [0.2217, 0.2407] |
1242 [0.2217, 0.2407] |
1197 [0.2217, 0.2408] |
eu-es | — [0.0873, 0.1038] |
— [0.0921, 0.1093] |
697 [0.0870, 0.1030] |
598 [0.0972, 0.1149] |
2253 [0.0965, 0.1142] |
2088 [0.0971, 0.1147] |
1382 [0.0971, 0.1148] |
1266 [0.0971, 0.1148] |
1022 [0.0973, 0.1150] |
995 [0.0973, 0.1150] |
mk-en | — [0.2300, 0.2511] |
— [0.2976, 0.3230] |
1385 [0.2337, 0.2563] |
1079 [0.2829, 0.3064] |
1684 [0.2838, 0.3071] |
1635 [0.2834, 0.3067] |
1323 [0.2825, 0.3058] |
1271 [0.2827, 0.3059] |
1198 [0.2827, 0.3059] |
1079 |
Learning monolingually
Setup:
- SL side of the training corpus
- All possibilities translated and scored
- Absolute winners taken
- Rules generated by counting ngrams in the same way as with the parallel corpus, only no alignment needed as it works like an annotated corpus.
Out of domain
LER
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | [44.5, 52.0] | [34.7, 41.9] | [24.7, 31.9] | [30.2, 37.9] | [30.2, 37.9] | [29.2, 37.0] | [29.3, 36.8] | [29.0, 36.4] | [29.1, 36.5] |
BLEU
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
en-es | [0.1885, 0.2133] | [0.1953, 0.2201] | [0.1832, 0.2067] | [0.1806, 0.2042] | [0.1806, 0.2042] | [0.1808, 0.2043] | [0.1810, 0.2046] | [0.1809, 0.2045] | [0.1809, 0.2045] |
In domain
LER
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [58.9, 64.8] |
— [44.2, 50.5] |
168 [54.8, 60.7] |
115 |
261 [53.5, 59.2] |
247 [52.1, 58.2] |
172 [54.3, 60.2] |
165 [52.7, 58.4] |
138 [50.5, 56.3] |
136 [50.6, 56.6] |
en-es | — [21.0, 25.3] |
— [15.1, 18.9] |
667 [20.7, 25.1] |
? ? |
2595 [15.0, 19.0] |
2436 [15.1, 19.1] |
1520 [13.7, 17.6] |
1402 [13.6, 17.3] |
1065 [13.9, 17.7] |
1024 |
eu-es | — [41.1, 46.6] |
— [38.8, 44.2] |
? [47.8, 53.0] |
? |
2631 [40.9, 46.4] |
2427 [40.9, 46.5] |
1186 [40.7, 46.1] |
1025 [40.7, 46.2] |
685 [40.5, 45.9] |
641 [40.5, 45.9] |
mk-en | — [42.4, 46.3] |
— [27.1, 30.8] |
1385 [28.8, 32.6] |
? |
1698 [27.8, 31.5] |
1662 [27.8, 31.4] |
1321 [27.8, 31.4] |
1285 [27.8, 31.4] |
1186 [27.7, 31.4] |
1180 [27.7, 31.4] |
BLEU
Pair | freq | tlm | ling | alig | rules (c>1.5) |
rules (c>2.0) |
rules (c>2.5) |
rules (c>3.0) |
rules (c>3.5) |
rules (c>4.0) |
---|---|---|---|---|---|---|---|---|---|---|
br-fr | — [0.1247, 0.1420] |
— [0.1397, 0.1572] |
168 [0.1325, 0.1503] |
115 |
261 [0.1250, 0.1425] |
247 [0.1252, 0.1429] |
172 [0.1240, 0.1412] |
165 [0.1243, 0.1416] |
138 [0.1255, 0.1429] |
136 [0.1255, 0.1429] |
en-es | — [0.2151, 0.2340] |
— [0.2197, 0.2384] |
667 [0.2148, 0.2337] |
? |
2595 [0.2180, 0.2371] |
2436 [0.2180, 0.2372] |
1520 [0.2190, 0.2380] |
1402 [0.2190, 0.2381] |
1065 [0.2189, 0.2380] |
1024 [0.2189, 0.2380] |
eu-es | — [0.0873, 0.1038] |
— [0.0921, 0.1093] |
? [0.0870, 0.1030] |
? |
2631 [0.0875, 0.1040] |
2427 [0.0878, 0.1042] |
1186 [0.0878, 0.1043] |
1025 [0.0878, 0.1043] |
685 [0.0879, 0.1043] |
641 [0.0879, 0.1043] |
mk-en | — [0.2300, 0.2511] |
— [0.2976, 0.3230] |
1385 [0.2567, 0.2798] |
1698 [0.2694, 0.2930] |
1662 [0.2695, 0.2931] |
1321 [0.2696, 0.2935] |
1285 [0.2696, 0.2935] |
1186 [0.2696, 0.2934] |
1180 [0.2696, 0.2934] |
Processing
Basque→Spanish
2081 cat europako_testuak_memoria_2010.tmx | iconv -f utf-16 -t utf-8 > europako_testuak_memoria_2010.tmx.u8 2082 cat 2010_memo_orokorra.tmx | iconv -f utf-16 -t utf-8 > 2010_memo_orokorra.tmx.u8 2088 python3 process-tmx.py europako_testuak_memoria_2010.tmx.u8 > europako_testuak_memoria_2010.txt 2090 python3 process-tmx.py 2010_memo_orokorra.tmx.u8 > 2010_memo_orokorra.txt 2091 cat 2010_memo_orokorra.txt | grep '^es' | cut -f2- > 2010_memo_orokorra.es.txt 2092 cat 2010_memo_orokorra.txt | grep '^eu' | cut -f2- > 2010_memo_orokorra.eu.txt 2094 cat europako_testuak_memoria_2010.txt | grep '^es' | cut -f2- > europako_testuak_memoria_2010.es.txt 2095 cat europako_testuak_memoria_2010.txt | grep '^eu' | cut -f2- > europako_testuak_memoria_2010.eu.txt 2099 cat europako_testuak_memoria_2010.es.txt 2010_memo_orokorra.es.txt > opendata.es 2100 cat europako_testuak_memoria_2010.eu.txt 2010_memo_orokorra.eu.txt > opendata.eu $ wc -l opendata.e* 782325 opendata.es 782325 opendata.eu 2114 perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl opendata eu es opendata.clean 1 40 2117 cat opendata.clean.eu |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ eu-es-pretransfer > opendata.tagged.eu 2126 cat opendata.clean.es |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ es-eu-pretransfer > opendata.tagged.es & 2132 seq 1 771238 > opendata.lines 2133 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f1 > opendata.lines.new 2134 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f2 > opendata.tagged.eu.new 2135 paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f3 > opendata.tagged.es.new 2137 mv opendata.lines.new opendata.lines 2138 mv opendata.tagged.es.new opendata.tagged.es 2139 mv opendata.tagged.eu.new opendata.tagged.eu 2146 cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil.bin >/tmp/eu-es.bil1 2148 cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil-noRL.bin >/tmp/eu-es.bil2 $ tail -n 1 /tmp/*.poly ==> /tmp/eu-es.bil1.poly <== 1.00240014637 ==> /tmp/eu-es.bil2.poly <== 1.3015831681 2191 mv /tmp/eu-es.bil2 opendata.biltrans.eu-es 2258 cat opendata.tagged.es | python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py es > opendata.token.es 2007 cat opendata.tagged.eu | python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py eu > opendata.token.eu 2014 cat opendata.biltrans.eu-es | python /home/fran/source/apertium-lex-tools/scripts/process-biltrans-output.py > opendata.token.eu-es & $ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \ /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus opendata.token -f eu -e es -alignment grow-diag-final-and \ -reordering msd-bidirectional-fe -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 & 2011 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f1 > opendata.lines.new& 2013 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f2 > opendata.eu.new & 2014 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f3 > opendata.es.new & 2017 mv opendata.lines.new opendata.lines 2018 mv opendata.es.new opendata.token.es 2019 mv opendata.eu.new opendata.token.eu 2032 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f1 > opendata.lines.new 2033 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f2 > opendata.eu.new & 2034 paste opendata.lines opendata.token.eu opendata.token.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f3 > opendata.es.new & 2035 mv opendata.lines.new opendata.lines 2036 mv opendata.es.new opendata.token.es 2037 mv opendata.eu.new opendata.token.eu 2055 cat opendata.token.es | sed 's/ *$//g' > opendata.token.es.new 2056 cat opendata.token.eu | sed 's/ *$//g' > opendata.token.eu.new 2057 mv opendata.token.es.new opendata.token.es 2058 mv opendata.token.eu.new opendata.token.eu
English→Spanish
2114 perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl en es europarl.clean 1 40 2056 cat europarl.clean.en | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.tagged.en & 2057 cat europarl.clean.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es es-en-pretransfer > europarl.tagged.es & 2073 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f1 > europarl.lines.new 2074 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.en.new 2075 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.es.new 2087 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f1 >europarl.lines.new 2088 bg 2089 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f2 >europarl.en.new& 2090 paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f3 >europarl.es.new& 2097 nohup cat europarl.tagged.en | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py en > europarl.token.en & 2098 nohup cat europarl.tagged.es | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py es > europarl.token.es & 2099 nohup cat europarl.biltrans.en-es | python ~/source/apertium-lex-tools/scripts/process-biltrans-output.py > europarl.token.en-es &
Macedonian→English
:%s/еfу/еѓу/g :%s/аfа/аѓа/g :%s/оfа/оѓа/g :%s/уfе/уѓе/g :%s/нfи/нѓи/g :%s/Ѓиниfиќ/Ѓинѓиќ/g :%s/еfе/еѓе/g :%s/уfм/уѓм/g :%s/рfи/рѓи/g :%s/ fе / ѓе /g :%s/рfе/рѓе/g :%s/уfи/уѓи/g :%s/ fу/ ѓу/g :%s/Караfорѓевиќ/Караѓорѓевиќ/g :%s/Холанfанец/Холанѓанец/g :%s/реfаваат/реѓаваат/g :%s/Швеfанката/Швеѓанката/g :%s/Новозеланfани/Новозеланѓани/g :%s/Мрfан/Мрѓан/g :%s/Анfелка/Анѓелка/g :%s/рfосаната/рѓосаната/g :%s/оттуfуваоето/оттуѓуваоето/g :%s/Енfел/Енѓел/g :%s/Караfорѓевиќ/Караѓорѓевиќ/g :%s/маfународната/маѓународната/g :%s/Пеfа/Пеѓа/g :%s/маfепсник/маѓепсник/g :%s/Караfорѓе/Караѓорѓе/g :%s/Граfевинар/Граѓевинар/g :%s/Меfаши/Меѓаши/g :%s/Ванfел/Ванѓел/g :%s/Караfиќ/Караѓиќ/g :%s/Анfели/Анѓели/g :%s/саfи/саѓи/g :%s/маfионичарски/маѓионичарски/g :%s/Караfорѓевиќ/Караѓорѓевиќ/g :%s/панаfур/панаѓур/g :%s/Ѓерf/Ѓерѓ/g :%s/Ѓинѓиf/Ѓинѓиѓ/g 2042 paste setimes.mk setimes.en| grep -v '^(' | cut -f1 > setimes.mk.new 2043 paste setimes.mk setimes.en| grep -v '^(' | cut -f2 > setimes.en.new 2044 paste setimes.en.new setimes.mk.new | grep -v '^(' | cut -f1 > setimes.en 2045 paste setimes.en.new setimes.mk.new | grep -v '^(' | cut -f2 > setimes.mk perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl setimes mk en setimes.clean 1 40 2052 cat setimes.clean.mk | apertium-destxt | apertium -f none -d ~/source/apertium-mk-en/ mk-en-pretransfer > setimes.tagged.mk& 2054 cat setimes.clean.en | apertium-destxt | apertium -f none -d ~/source/apertium-mk-en/ en-mk-pretransfer > setimes.tagged.en& 2063 seq 1 190503 > setimes.lines 2064 paste setimes.lines setimes.tagged.mk setimes.tagged.en | grep '<' | cut -f1 > setimes.lines.new 2065 paste setimes.lines setimes.tagged.mk setimes.tagged.en | grep '<' | cut -f2 > setimes.mk.new 2066 paste setimes.lines setimes.tagged.mk setimes.tagged.en | grep '<' | cut -f3 > setimes.en.new 2067 mv setimes.en.new setimes.tagged.en 2068 mv setimes.mk.new setimes.tagged.mk 2069 mv setimes.lines.new setimes.lines 2077 nohup cat setimes.tagged.mk | lt-proc -b ~/source/apertium-mk-en/mk-en.autobil.bin > setimes.biltrans.mk-en & 2122 cat setimes.tagged.mk | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py mk > setimes.token.mk & 2123 cat setimes.tagged.en | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py en > setimes.token.en &