Difference between revisions of "User:Francis Tyers/Experiments"

Latest revision as of 10:49, 22 November 2012

TODO[edit]

~~Do LER in/out domain testing for the en-es setup with news commentary.~~
~~Do BLEU in/out domain testing for the en-es setup with news commentary.~~
~~mk-en: why is TLM LER/BLEU so much better ?~~
- (partial) answer: 0-context rules (e.g. defaults) not applying properly. Fixed by running in series. This "solves" the LER issue.
- (partial) answer: preposition selection is much better. We could try running with ling-default preps.
Do pairwise bootstrap resampling for each of best baseline + best rules
- (done) for parallel
~~why do eu-es rules not improve over freq ?~~
- (partial) answer: some rules do not apply because of tag wankery. See line #129774 in the test corpus. Need to define better how tags work. Perhaps only include tags where ambiguous ?
~~why do breton numbers for monolingual rules not approach TLM ?~~
- because of crispiness being too low.
~~why when we add more data, do the results get worse ?~~
- because of crispiness being too low.
rerun the mk-en stuff with frac counts.
run br-fr test with huge data.
try decreasing the C with corpus size.

Corpus stats[edit]

Pair	Corpus	Lines	W. (src)	SL cov.	Extracted	Extracted (%)	L. (train)	L. (test)	L (dev)	Uniq. tokens >1 trad.	Avg. trad / word
br-fr	oab	57,305	702,328	94.47%	4,668	8.32	2,668	1,000	1,000	603	1.07
en-es	europarl	1,467,708	30,154,098	98.08%	312,162	22.18	310,162	1,000	1,000	2,082	1.08
eu-es	opendata.euskadi.net	765,115	10,190,079	91.70%	87,907	11.48	85,907	1,000	1,000	1,806	1.30
mk-en	setimes	190,493	4,259,338	92.17%	19,747	10.94	17,747	1,000	1,000	13,134	1.86
sh-mk	setimes

Evaluation corpus[edit]

Out of domain[edit]

Pair	Lines	Words (L1)	Words (L2)	Ambig. tokens	Ambig. types	Ambig token/type	% ambig	Av. trad/word
en-es	434	9,463	10,280	619	303	2.04	6.54%	-

In domain[edit]

Pair	Lines	Words (L1)	Words (L2)	Ambig. tokens	Ambig. types	Ambig token/type	% ambig	Av. trad/word
br-fr	1,000	13,854	13,878	1,163	372	3.13	8.39%	-
en-es	1,000	19,882	20,944	1,469	337	4.35	7.38%	-
eu-es	1,000	7,967	11,476	1,360	412	3.30	17.07%	-
mk-en	1,000	13,441	14,228	3,872	1,289	3.00	28.80%	-

% ambig = number of SL tokens with >1 translation

EAMT-style results[edit]

Out of domain[edit]

LER[edit]

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
en-es	— [44.5, 52.0]	— [34.7, 41.9]	667 [24.7, 31.9]	630 [ 21.4 , 28.4 ]	2881 [20.2, 27.2]	2728 [20.2, 27.2]	1683 [20.7, 27.6]	1578 [20.7, 27.6]	1242 [20.7, 27.6]	1197 [20.7, 27.6]

BLEU[edit]

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
en-es	[0.1885, 0.2133]	[0.1953, 0.2201]	[0.1832, 0.2067]	[0.1832, 0.2067]	[0.1831, 0.2067]	[0.1830, 0.2067]	[ [0.1828, 0.2063]	[0.1828, 0.2063]	[0.1828, 0.2063]

In domain[edit]

LER[edit]

$c$ is the "crispiness" ratio, the amount of times an alternative translation is seen in a given context compared to the default translation. So, a $c$ of 2.0 means that the translation appears twice as frequently as the default.

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
br-fr	— [58.9, 64.8]	— [44.2, 50.5]	168 [54.8, 60.7]	115 [28.5, 34.1]	221 [27.8, 33.3]	213 [27.6, 33.0]	159 [26.3, 31.8]	150 [26.1, 31.6]	135 [27.2, 32.8]	135 [27.2, 32.8]
en-es	— [21.0, 25.3]	— [15.1, 18.9]	667 [20.7, 25.1	630 [7.2, 10.0]	2881 [5.9, 8.6]	2728 [6.0, 8.6]	1683 [5.7, 8.3]	1578 [5.7, 8.3]	1242 [6.0, 8.5]	1197 [5.9, 8.6]
eu-es	— [41.1, 46.6]	— [38.8, 44.2]	697 [47.8, 53.0]	598 [16.5, 20.8]	2253 [20.2, 24.7]	2088 [17.2, 21.7]	1382 [16.8, 21.0]	1266 [16.1, 20.4]	1022 [15.9, 20.2]	995 [16.0, 20.3]
mk-en	— [42.4, 46.3]	— [27.1, 30.8]	1385 [28.8, 32.6]	1079 [19.0, 22.2]	1684 [18.5, 21.5]	1635 [18.6, 21.6]	1323 [19.1, 22.2]	1271 [19.0, 22.0]	1198 [19.1, 22.1]	1079 [19.1, 22.1]

BLEU[edit]

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
br-fr	— [0.1247, 0.1420]	— [0.1397, 0.1572]	168 [0.1325, 0.1503]	115 [0.1344, 0.1526]	221 [0.1367, 0.1551]	213 [0.1367, 0.1549]	159 [0.1374, 0.1554]	150 [0.1364, 0.1543]	135 [0.1352, 0.1535]	135 [0.1352, 0.1535]
en-es	— [0.2151, 0.2340]	— [0.2197, 0.2384]	667 [0.2148, 0.2337]	630 [0.2208, 0.2398]	2881 [0.2217, 0.2405]	2728 [0.2217, 0.2406]	1683 [0.2217, 0.2407]	1578 [0.2217, 0.2407]	1242 [0.2217, 0.2407]	1197 [0.2217, 0.2408]
eu-es	— [0.0873, 0.1038]	— [0.0921, 0.1093]	697 [0.0870, 0.1030]	598 [0.0972, 0.1149]	2253 [0.0965, 0.1142]	2088 [0.0971, 0.1147]	1382 [0.0971, 0.1148]	1266 [0.0971, 0.1148]	1022 [0.0973, 0.1150]	995 [0.0973, 0.1150]
mk-en	— [0.2300, 0.2511]	— [0.2976, 0.3230]	1385 [0.2337, 0.2563]	1079 [0.2829, 0.3064]	1684 [0.2838, 0.3071]	1635 [0.2834, 0.3067]	1323 [0.2825, 0.3058]	1271 [0.2827, 0.3059]	1198 [0.2827, 0.3059]	1079

Learning monolingually (winner-takes-all)[edit]

Setup:

SL side of the training corpus
All possibilities translated and scored
Absolute winners taken
Rules generated by counting ngrams in the same way as with the parallel corpus, only no alignment needed as it works like an annotated corpus.

Out of domain[edit]

LER[edit]

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
en-es	[44.5, 52.0]	[34.7, 41.9]	[24.7, 31.9]		[30.2, 37.9]	[30.2, 37.9]	[29.2, 37.0]	[29.3, 36.8]	[29.0, 36.4]	[29.1, 36.5]

BLEU[edit]

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
en-es	[0.1885, 0.2133]	[0.1953, 0.2201]	[0.1832, 0.2067]		[0.1806, 0.2042]	[0.1806, 0.2042]	[0.1808, 0.2043]	[0.1810, 0.2046]	[0.1809, 0.2045]	[0.1809, 0.2045]

In domain[edit]

LER[edit]

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
br-fr	— [58.9, 64.8]	— [44.2, 50.5]	168 [54.8, 60.7]	115	261 [53.5, 59.2]	247 [52.1, 58.2]	172 [54.3, 60.2]	165 [52.7, 58.4]	138 [50.5, 56.3]	136 [50.6, 56.6]
en-es	— [21.0, 25.3]	— [15.1, 18.9]	667 [20.7, 25.1]	? ?	2595 [15.0, 19.0]	2436 [15.1, 19.1]	1520 [13.7, 17.6]	1402 [13.6, 17.3]	1065 [13.9, 17.7]	1024 [13.9, 17.8]
eu-es	— [41.1, 46.6]	— [38.8, 44.2]	? [47.8, 53.0]	?	2631 [40.9, 46.4]	2427 [40.9, 46.5]	1186 [40.7, 46.1]	1025 [40.7, 46.2]	685 [40.5, 45.9]	641 [40.5, 45.9]
mk-en	— [42.4, 46.3]	— [27.1, 30.8]	1385 [28.8, 32.6]	?	1698 [27.8, 31.5]	1662 [27.8, 31.4]	1321 [27.8, 31.4]	1285 [27.8, 31.4]	1186 [27.7, 31.4]	1180 [27.7, 31.4]

BLEU[edit]

Pair	freq	tlm	ling	alig	rules (c>1.5)	rules (c>2.0)	rules (c>2.5)	rules (c>3.0)	rules (c>3.5)	rules (c>4.0)
br-fr	— [0.1247, 0.1420]	— [0.1397, 0.1572]	168 [0.1325, 0.1503]	115	261 [0.1250, 0.1425]	247 [0.1252, 0.1429]	172 [0.1240, 0.1412]	165 [0.1243, 0.1416]	138 [0.1255, 0.1429]	136 [0.1255, 0.1429]
en-es	— [0.2151, 0.2340]	— [0.2197, 0.2384]	667 [0.2148, 0.2337]	?	2595 [0.2180, 0.2371]	2436 [0.2180, 0.2372]	1520 [0.2190, 0.2380]	1402 [0.2190, 0.2381]	1065 [0.2189, 0.2380]	1024 [0.2189, 0.2380]
eu-es	— [0.0873, 0.1038]	— [0.0921, 0.1093]	? [0.0870, 0.1030]	?	2631 [0.0875, 0.1040]	2427 [0.0878, 0.1042]	1186 [0.0878, 0.1043]	1025 [0.0878, 0.1043]	685 [0.0879, 0.1043]	641 [0.0879, 0.1043]
mk-en	— [0.2300, 0.2511]	— [0.2976, 0.3230]	1385 [0.2567, 0.2798]		1698 [0.2694, 0.2930]	1662 [0.2695, 0.2931]	1321 [0.2696, 0.2935]	1285 [0.2696, 0.2935]	1186 [0.2696, 0.2934]	1180 [0.2696, 0.2934]

Learning monolingually (fractional counts)[edit]

Setup:

SL side of the training corpus
All possibilities translated and scored
Probabilities normalised into fractional counts (e.g. add them up to get a total, then divide each prob by the total).
- log prob converted into normal prob using exp10()
Rules generated by counting fractions from the translated file.

In domain[edit]

LER[edit]

BLEU[edit]

Out of domain[edit]

LER[edit]

BLEU[edit]

MaxEnt[edit]

With alignments[edit]

Pair	alig	rule-best	ME (>5)	ME (>3)
br-fr	33.4	31.5	31.8	29.9
mk-en	19.9	19.8	18.9	17.8
eu-es	18.5	17.9	17.4	19.9
en-es	8.6	7.0	6.3	6.3

With fractional counts[edit]

Pair	alig	rule-best	ME (>5)	ME (>3)	ME (>1)	ME (>0)
br-fr	43.4	43.1	61.9	46.2	48.2	49.9
mk-en	29.5
eu-es	41.2		43.9	44.4
en-es	11.9	11.7	11.4	11.9

Notes[edit]

Experiments in Domain Adaptation for Statistical Machine Translation
Semi-supervised model adaptation for statistical machine translation
Domain Adaptation for Statistical Machine Translation with Monolingual Resources
"We found that the largest gain (25% relative) is achieved when in-domain data are available for the target language. A smaller performance improvement is still observed (5% relative) if source adaptation data are available. We also observed that the most important role is played by the LM adaptation, while the adaptation of the TM and RM gives consistent but small improvement."

@@ Line 1: / Line 1: @@
 {{TOCD}}
+==TODO==
+* <s>Do LER in/out domain testing for the en-es setup with news commentary.</s>
+* <s>Do BLEU in/out domain testing for the en-es setup with news commentary.</s>
+* <s>mk-en: why is TLM LER/BLEU so much better ?</s>
+** (partial) answer: 0-context rules (e.g. defaults) not applying properly. Fixed by running in series. This "solves" the LER issue.
+** (partial) answer: preposition selection is much better. We could try running with ling-default preps.
+* Do pairwise bootstrap resampling for each of best baseline + best rules
+** (done) for parallel
+* <s>why do eu-es rules not improve over freq ?</s>
+** (partial) answer: some rules do not apply because of tag wankery. See line #129774 in the test corpus. Need to define better how tags work. Perhaps only include tags where ambiguous ?
+* <s>why do breton numbers for monolingual rules not approach TLM ? </s>
+** because of crispiness being too low.
+* <s>why when we add more data, do the results get worse ? </s>
+** because of crispiness being too low.
+* rerun the mk-en stuff with frac counts.
+* run br-fr test with huge data.
+* try decreasing the C with corpus size.
+==Corpus stats==
 {|class=wikitable
-! Language pair !! Corpus               !! Lines     !! W. (src)   !! SL cov. ||  L. (train) !! L. (test) !! Uniq. tokens >1 trad. !! Avg. trad / word !!
+! Pair !! Corpus               !! Lines     !! W. (src)   !! SL cov. || Extracted || Extracted (%) || L. (train) !! L. (test) !! L (dev) !! Uniq. tokens >1 trad. !! Avg. trad / word !!
+|-
+| br-fr         || oab                  || 57,305    || 702,328    || 94.47% || 4,668  || 8.32 || 2,668  || 1,000 || 1,000  || 603    || 1.07
 |-
-| br-fr         || oab                  ||           ||            ||        ||   ||   ||     ||
+| en-es         || europarl             || 1,467,708 || 30,154,098 || 98.08% || 312,162  || 22.18 || 310,162 || 1,000|| 1,000  || 2,082  || 1.08
 |-
-| en-es         || europarl             || 1,467,708 || 30,154,098 || 98.08% || - || - || 2,082 || 1.08
+| eu-es         || opendata.euskadi.net || 765,115   || 10,190,079 || 91.70% || 87,907 || 11.48 || 85,907 || 1,000|| 1,000  || 1,806  || 1.30
 |-
-| eu-es         || opendata.euskadi.net || 765,115   || 10,190,079 || 91.70% || - || - || 1,806 || 1.30
+| mk-en         || setimes              || 190,493   || 4,259,338  || 92.17% || 19,747  || 10.94 || 17,747  || 1,000|| 1,000   || 13,134 || 1.86
 |-
-| mk-en         || setimes              ||           ||            ||        ||   ||   ||     ||
+| sh-mk         || setimes              ||           ||            ||        ||   || ||   ||   ||     || ||
 |-
-| sh-mk         || setimes              ||           ||            ||        ||   ||   ||     ||
 |}
+===Evaluation corpus===
-==Processing==
-===Basque→Spanish===
+====Out of domain====
-<pre>
-cat europako_testuak_memoria_2010.tmx | iconv -f utf-16 -t utf-8 > europako_testuak_memoria_2010.tmx.u8
-cat 2010_memo_orokorra.tmx | iconv -f utf-16 -t utf-8 > 2010_memo_orokorra.tmx.u8
-python3 process-tmx.py europako_testuak_memoria_2010.tmx.u8 > europako_testuak_memoria_2010.txt
-python3 process-tmx.py 2010_memo_orokorra.tmx.u8 > 2010_memo_orokorra.txt
-cat 2010_memo_orokorra.txt | grep '^es' | cut -f2- > 2010_memo_orokorra.es.txt
-cat 2010_memo_orokorra.txt | grep '^eu' | cut -f2- > 2010_memo_orokorra.eu.txt
-cat europako_testuak_memoria_2010.txt | grep '^es' | cut -f2- > europako_testuak_memoria_2010.es.txt
-cat europako_testuak_memoria_2010.txt | grep '^eu' | cut -f2- > europako_testuak_memoria_2010.eu.txt
-cat europako_testuak_memoria_2010.es.txt 2010_memo_orokorra.es.txt > opendata.es
-cat europako_testuak_memoria_2010.eu.txt 2010_memo_orokorra.eu.txt > opendata.eu
+{|class=wikitable
+! Pair    || Lines || Words (L1) || Words (L2) || Ambig. tokens || Ambig. types || Ambig token/type || % ambig || Av. trad/word
+|-
+| en-es   || 434   || 9,463      ||  10,280    || 619           || 303          ||  2.04            || 6.54%   || -
+|-
+|}
+====In domain====
-$ wc -l opendata.e*
-opendata.es
-opendata.eu
+{|class=wikitable
+! Pair    || Lines || Words (L1) || Words (L2) || Ambig. tokens || Ambig. types || Ambig token/type || % ambig || Av. trad/word
+|-
+| br-fr   || 1,000 || 13,854     || 13,878     || 1,163         || 372          || 3.13             || 8.39% || -
+|-
+| en-es   || 1,000 || 19,882     || 20,944     || 1,469         || 337          || 4.35             || 7.38% || -
+|-
+| eu-es   || 1,000 || 7,967      || 11,476     || 1,360         || 412          || 3.30             || 17.07% || -
+|-
+| mk-en   || 1,000 || 13,441     || 14,228     || 3,872         || 1,289        || 3.00             || 28.80% || -
+|-
+|}
+* % ambig = number of SL tokens with >1 translation
-perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl opendata eu es opendata.clean 1 80
+==EAMT-style results==
-cat opendata.clean.eu |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ eu-es-pretransfer > opendata.tagged.eu
-cat opendata.clean.es |apertium-destxt | apertium -f none -d ~/source/apertium-eu-es/ es-eu-pretransfer > opendata.tagged.es &
+===Out of domain===
+====LER====
-seq 1 771238 > opendata.lines
-paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f1 > opendata.lines.new
-paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f2 > opendata.tagged.eu.new
-paste opendata.lines opendata.tagged.eu opendata.tagged.es | grep '<' | cut -f3 > opendata.tagged.es.new
+{|class=wikitable style="text-align: center"
-mv opendata.lines.new opendata.lines
+! Pair    !!  freq !! tlm  !! ling         !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
-mv opendata.tagged.es.new opendata.tagged.es
+|-
-mv opendata.tagged.eu.new opendata.tagged.eu
+| en-es   || <small>&mdash;</small><br/>[44.5, 52.0] || <small>&mdash;</small><br/>[34.7, 41.9] || <small>667</small><br/>[24.7, 31.9] || <small>630</small><br/>'''[ 21.4 , 28.4 ]''' || <small>2881</small><br/>'''[20.2, 27.2]''' || <small>2728</small><br/>'''[20.2, 27.2]''' || <small>1683</small><br/>[20.7, 27.6] ||  <small>1578</small><br/>[20.7, 27.6] || <small>1242</small><br/>[20.7, 27.6] || <small>1197</small><br/>[20.7, 27.6]
+|-
+|}
+====BLEU====
-cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil.bin  >/tmp/eu-es.bil1
+<small>
-cat opendata.tagged.eu | lt-proc -b ~/source/apertium-eu-es/eu-es.autobil-noRL.bin  >/tmp/eu-es.bil2
+{|class=wikitable style="text-align: center"
+! Pair    !!  freq              !! tlm                 !! ling                    !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
+|-
+| en-es   ||  [0.1885, 0.2133]   || [0.1953, 0.2201]   ||    [0.1832, 0.2067]    ||  [0.1832, 0.2067]  ||   [0.1831, 0.2067]  || [0.1830, 0.2067] || [ [0.1828, 0.2063]     ||[0.1828, 0.2063] ||[0.1828, 0.2063] ||
+|-
+|}
+</small>
+===In domain===
-$ tail -n 1 /tmp/*.poly
-==> /tmp/eu-es.bil1.poly <==
-.00240014637
+====LER====
-==> /tmp/eu-es.bil2.poly <==
-.3015831681
+<math>c</math> is the "crispiness" ratio, the amount of times an alternative translation is seen in a given context compared to the default translation. So, a <math>c</math> of 2.0 means that the translation appears twice as frequently as the default.
-mv /tmp/eu-es.bil2 opendata.biltrans.eu-es
+{|class=wikitable style="text-align: center"
-cat opendata.tagged.es | python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py es > opendata.token.es
+! Pair    !!  freq !! tlm  !! ling         !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
-cat opendata.tagged.eu |  python /home/fran/source/apertium-lex-tools/scripts/process-tagger-output.py eu > opendata.token.eu
+|-
-cat opendata.biltrans.eu-es | python /home/fran/source/apertium-lex-tools/scripts/process-biltrans-output.py > opendata.token.eu-es &
+| br-fr   ||  <small>&mdash;</small><br/>[58.9, 64.8]     ||   <small>&mdash;</small><br/>[44.2, 50.5]   || <small>168</small><br/>[54.8, 60.7] || <small>115</small><br/>[28.5, 34.1]  || <small>221</small><br/>[27.8, 33.3]  ||align="center"| <small>213</small><br/>[27.6, 33.0] || <small>159</small><br/>[26.3, 31.8]  || <small>150</small><br/>'''[26.1, 31.6]'''   || <small>135</small><br/>[27.2, 32.8]    || <small>135</small><br/> [27.2, 32.8]
+|-
-$ nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
+| en-es   ||  <small>&mdash;</small><br/>[21.0, 25.3]     ||   <small>&mdash;</small><br/>[15.1, 18.9]   || <small>667</small><br/>[20.7, 25.1 || <small>630</small><br/>[7.2, 10.0]  ||   <small>2881</small><br/>[5.9, 8.6]     ||  <small>2728</small><br/>[6.0, 8.6]        ||   <small>1683</small><br/>'''[5.7, 8.3]'''     ||  <small>1578</small><br/>'''[5.7, 8.3]'''        ||  <small>1242</small><br/>[6.0, 8.5]   || <small>1197</small><br/>[5.9, 8.6]
- /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus opendata.token -f eu -e es -alignment grow-diag-final-and \
+|-
- -reordering msd-bidirectional-fe  -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
+| eu-es   ||   <small>&mdash;</small><br/>[41.1, 46.6]    ||   <small>&mdash;</small><br/>[38.8, 44.2]   || <small>697</small><br/>[47.8, 53.0] || <small>598</small><br/>[16.5, 20.8]   ||   <small>2253</small><br/>[20.2, 24.7]     ||   <small>2088</small><br/>[17.2, 21.7]     || <small>1382</small><br/>[16.8, 21.0]      || <small>1266</small><br/>[16.1, 20.4]        || <small>1022</small><br/>'''[15.9, 20.2]'''  ||<small>995</small><br/>[16.0, 20.3]
+|-
+| mk-en   ||   <small>&mdash;</small><br/>[42.4, 46.3]      ||  <small>&mdash;</small><br/>[27.1, 30.8]    || <small>1385</small><br/>[28.8, 32.6] || <small>1079</small><br/>[19.0, 22.2] || <small>1684</small><br/>'''[18.5, 21.5]'''   || <small>1635</small><br/>[18.6, 21.6]  ||   <small>1323</small><br/>[19.1, 22.2]     ||    <small>1271</small><br/>[19.0, 22.0]    ||  <small>1198</small><br/>[19.1, 22.1]  ||  <small>1079</small><br/> [19.1, 22.1]
+|-
+|}
+====BLEU====
-paste opendata.lines opendata.token.eu opendata.token.es  | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f1 > opendata.lines.new&
-paste opendata.lines opendata.token.eu opendata.token.es  | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f2 > opendata.eu.new &
-paste opendata.lines opendata.token.eu opendata.token.es  | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f3 > opendata.es.new &
+<small>
+{|class=wikitable style="text-align: center"
+! Pair    !!  freq !! tlm  !! ling         !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
+|-
+| br-fr   ||  <small>&mdash;</small><br/>[0.1247, 0.1420]     ||   <small>&mdash;</small><br/><b>[0.1397, 0.1572]</b>   || <small>168</small><br/>[0.1325, 0.1503] || <small>115</small><br/>[0.1344, 0.1526]  || <small>221</small><br/>[0.1367, 0.1551]  ||align="center"| <small>213</small><br/>[0.1367, 0.1549] || <small>159</small><br/><b>[0.1374, 0.1554]</b>  || <small>150</small><br/>[0.1364, 0.1543] || <small>135</small><br/>[0.1352, 0.1535]   || <small>135</small><br/>[0.1352, 0.1535]
+|-
+| en-es   || <small>&mdash;</small><br/>[0.2151, 0.2340]    ||  <small>&mdash;</small><br/>[0.2197, 0.2384]    || <small>667</small><br/>[0.2148, 0.2337] || <small>630</small><br/>[0.2208, 0.2398] || <small>2881</small><br/>[0.2217, 0.2405] || <small>2728</small><br/>[0.2217, 0.2406] || <small>1683</small><br/><b>[0.2217, 0.2407]</b> || <small>1578</small><br/><b>[0.2217, 0.2407]</b> || <small>1242</small><br/>[0.2217, 0.2407] || <small>1197</small><br/>[0.2217, 0.2408]
+|-
+| eu-es || <small>&mdash;</small><br/>[0.0873, 0.1038] || <small>&mdash;</small><br/>[0.0921, 0.1093] || <small>697</small><br/>[0.0870, 0.1030] || <small>598</small><br/>[0.0972, 0.1149] ||  <small>2253</small><br/>[0.0965, 0.1142] ||  <small>2088</small><br/>[0.0971, 0.1147] || <small>1382</small><br/>[0.0971, 0.1148] || <small>1266</small><br/>[0.0971, 0.1148] || <small>1022</small><br/>'''[0.0973, 0.1150]''' || <small>995</small><br/>'''[0.0973, 0.1150]'''
+|-
+| mk-en   ||   <small>&mdash;</small><br/>[0.2300, 0.2511]    ||  <small>&mdash;</small><br/>'''[0.2976, 0.3230]'''    || <small>1385</small><br/>[0.2337, 0.2563] || <small>1079</small><br/>[0.2829, 0.3064] || <small>1684</small><br/>'''[0.2838, 0.3071]'''  || <small>1635</small><br/>[0.2834, 0.3067] ||   <small>1323</small><br/> [0.2825, 0.3058]   ||    <small>1271</small><br/>[0.2827, 0.3059]    ||  <small>1198</small><br/>[0.2827, 0.3059]  ||  <small>1079</small><br/>
+|-
+|}
+</small>
+==Learning monolingually (winner-takes-all)==
-mv opendata.lines.new opendata.lines
-mv opendata.es.new opendata.token.es
-mv opendata.eu.new opendata.token.eu
+Setup:
+* SL side of the training corpus
-paste opendata.lines opendata.token.eu opendata.token.es  | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f1 > opendata.lines.new
+* All possibilities translated and scored
-paste opendata.lines opendata.token.eu opendata.token.es  | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f2 > opendata.eu.new &
+* Absolute winners taken
-paste opendata.lines opendata.token.eu opendata.token.es  | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | cut -f3 > opendata.es.new &
+* Rules generated by counting ngrams in the same way as with the parallel corpus, only no alignment needed as it works like an annotated corpus.
-mv opendata.lines.new opendata.lines
-mv opendata.es.new opendata.token.es
-mv opendata.eu.new opendata.token.eu
+===Out of domain===
-cat opendata.token.es | sed 's/ *$//g' > opendata.token.es.new
-cat opendata.token.eu | sed 's/ *$//g' > opendata.token.eu.new
-mv opendata.token.es.new opendata.token.es
-mv opendata.token.eu.new opendata.token.eu
+====LER====
+{|class=wikitable style="text-align: center"
-</pre>
+! Pair    !!  freq !! tlm  !! ling         !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
+|-
+| en-es   ||  [44.5, 52.0]  ||  [34.7, 41.9]    || [24.7, 31.9]           ||               ||  [30.2, 37.9] || [30.2, 37.9] || [29.2, 37.0] || [29.3, 36.8] || '''[29.0, 36.4]''' || [29.1, 36.5]
+|-
+|}
-===English→Spanish===
+====BLEU====
-<pre>
+<small>
+{|class=wikitable style="text-align: center"
-cat europarl.clean.en | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
+! Pair    !!  freq !! tlm  !! ling         !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
-cat europarl.clean.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
+|-
+| en-es   ||  [0.1885, 0.2133]   || [0.1953, 0.2201]   ||    [0.1832, 0.2067]       ||               ||  [0.1806, 0.2042] ||  [0.1806, 0.2042] || [0.1808, 0.2043]  || [0.1810, 0.2046]  || [0.1809, 0.2045]  ||  [0.1809, 0.2045]
+|-
+|}
+</small>
+===In domain===
-paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f1 > europarl.lines.new
-paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.en.new
-paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.es.new
+====LER====
-paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f1 >europarl.lines.new
-bg
-paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f2 >europarl.en.new&
-paste europarl.lines europarl.tagged.en europarl.tagged.es | grep -v '<sent>.*<sent>.*<sent>.*<sent>.*<sent>.*<sent>' | grep -v '\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*.*\*' | cut -f3 >europarl.es.new&
+{|class=wikitable style="text-align: center"
-nohup cat europarl.tagged.en | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py en > europarl.token.en &
+! Pair    !!  freq !! tlm  !! ling         !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
-nohup cat europarl.tagged.es | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py es > europarl.token.es &
+|-
-nohup cat europarl.biltrans.en-es | python ~/source/apertium-lex-tools/scripts/process-biltrans-output.py > europarl.token.en-es &
+| br-fr   ||  <small>&mdash;</small><br/>[58.9, 64.8]     ||   <small>&mdash;</small><br/>'''[44.2, 50.5]'''    || <small>168</small><br/>[54.8, 60.7]  || <small>115</small><br/>  || <small>261</small><br/>[53.5, 59.2]    ||align="center"| <small>247</small><br/>[52.1, 58.2] || <small>172</small><br/>[54.3, 60.2]   || <small>165</small><br/>[52.7, 58.4]  || <small>138</small><br/>'''[50.5, 56.3]'''   || <small>136</small><br/>[50.6, 56.6]
+|-
+| en-es   || <small>&mdash;</small><br/>[21.0, 25.3] || <small>&mdash;</small><br/>[15.1, 18.9] || <small>667</small><br/>[20.7, 25.1] || <small>?</small><br/>? || <small>2595</small><br/>[15.0, 19.0] || <small>2436</small><br/>[15.1, 19.1] || <small>1520</small><br/>[13.7, 17.6] || <small>1402</small><br/>'''[13.6, 17.3]''' || <small>1065</small><br/>[13.9, 17.7]   || <small>1024</small><br/>[13.9, 17.8]
+|-
+| eu-es   || <small>&mdash;</small><br/>[41.1, 46.6] || <small>&mdash;</small><br/>'''[38.8, 44.2]''' || <small>?</small><br/>[47.8, 53.0]  || <small>?</small><br/> || <small>2631</small><br/>[40.9, 46.4]  || <small>2427</small><br/>[40.9, 46.5]  || <small>1186</small><br/>[40.7, 46.1]  || <small>1025</small><br/>[40.7, 46.2]  || <small>685</small><br/>'''[40.5, 45.9]'''  || <small>641</small><br/>'''[40.5, 45.9]'''
+|-
+| mk-en   || <small>&mdash;</small><br/>[42.4, 46.3] || <small>&mdash;</small><br/>'''[27.1, 30.8]'''  || <small>1385</small><br/>[28.8, 32.6]  ||  <small>?</small><br/><!--[27.2, 30.8]--> ||  <small>1698</small><br/>[27.8, 31.5] ||  <small>1662</small><br/>[27.8, 31.4]  ||  <small>1321</small><br/>[27.8, 31.4] ||  <small>1285</small><br/>[27.8, 31.4] ||  <small>1186</small><br/>'''[27.7, 31.4]''' ||  <small>1180</small><br/>'''[27.7, 31.4]'''
+|-
+|}
+====BLEU====
-</pre>
+<small>
+{|class=wikitable style="text-align: center"
+! Pair    !!  freq !! tlm  !! ling         !! alig          !! rules<br/>(c>1.5) !! rules<br/>(c>2.0) !! rules<br/>(c>2.5) !! rules<br/>(c>3.0) !! rules<br/>(c>3.5) !! rules<br/>(c>4.0)
+|-
+| br-fr   ||  <small>&mdash;</small><br/>[0.1247, 0.1420]     ||   <small>&mdash;</small><br/><b>[0.1397, 0.1572]</b>   || <small>168</small><br/>[0.1325, 0.1503]   || <small>115</small><br/> || <small>261</small><br/>[0.1250, 0.1425] ||align="center"| <small>247</small><br/>[0.1252, 0.1429] || <small>172</small><br/>[0.1240, 0.1412]  || <small>165</small><br/>[0.1243, 0.1416] || <small>138</small><br/><b>[0.1255, 0.1429]</b>   || <small>136</small><br/><b>[0.1255, 0.1429]</b>
+|-
+| en-es   || <small>&mdash;</small><br/>[0.2151, 0.2340] || <small>&mdash;</small><br/><b>[0.2197, 0.2384]</b> || <small>667</small><br/>[0.2148, 0.2337]  || <small>?</small><br/> || <small>2595</small><br/>[0.2180, 0.2371] || <small>2436</small><br/>[0.2180, 0.2372] || <small>1520</small><br/>[0.2190, 0.2380] || <small>1402</small><br/><b>[0.2190, 0.2381]</b> || <small>1065</small><br/>[0.2189, 0.2380] || <small>1024</small><br/>[0.2189, 0.2380]
+|-
+| eu-es   || <small>&mdash;</small><br/>[0.0873, 0.1038] || <small>&mdash;</small><br/>'''[0.0921, 0.1093]''' || <small>?</small><br/>[0.0870, 0.1030]  || <small>?</small><br/> || <small>2631</small><br/>[0.0875, 0.1040]  || <small>2427</small><br/>[0.0878, 0.1042]  || <small>1186</small><br/>[0.0878, 0.1043]  || <small>1025</small><br/>[0.0878, 0.1043]  || <small>685</small><br/>'''[0.0879, 0.1043]'''  || <small>641</small><br/>'''[0.0879, 0.1043]'''
+|-
+| mk-en   || <small>&mdash;</small><br/>[0.2300, 0.2511]  || <small>&mdash;</small><br/>'''[0.2976, 0.3230]'''  || <small>1385</small><br/>[0.2567, 0.2798]  || ||  <small>1698</small><br/>[0.2694, 0.2930] ||  <small>1662</small><br/>[0.2695, 0.2931] ||  <small>1321</small><br/>'''[0.2696, 0.2935]''' ||  <small>1285</small><br/>'''[0.2696, 0.2935]'''  ||  <small>1186</small><br/>[0.2696, 0.2934]  ||  <small>1180</small><br/> [0.2696, 0.2934]
+|-
+|}
-===Macedonian→English===
-<pre>
+</small>
+==Learning monolingually (fractional counts)==
+Setup:
+* SL side of the training corpus
-:%s/еfу/еѓу/g
+* All possibilities translated and scored
-:%s/аfа/аѓа/g
+* Probabilities normalised into fractional counts (e.g. add them up to get a total, then divide each prob by the total).
-:%s/оfа/оѓа/g
+** log prob converted into normal prob using exp10()
-:%s/уfе/уѓе/g
+* Rules generated by counting fractions from the translated file.
-:%s/нfи/нѓи/g
-:%s/Ѓиниfиќ/Ѓинѓиќ/g
-:%s/еfе/еѓе/g
-:%s/уfм/уѓм/g
-:%s/рfи/рѓи/g
-:%s/ fе / ѓе /g
-:%s/рfе/рѓе/g
-:%s/уfи/уѓи/g
-:%s/ fу/ ѓу/g
-:%s/Караfорѓевиќ/Караѓорѓевиќ/g
-:%s/Холанfанец/Холанѓанец/g
-:%s/реfаваат/реѓаваат/g
-:%s/Швеfанката/Швеѓанката/g
-:%s/Новозеланfани/Новозеланѓани/g
-:%s/Мрfан/Мрѓан/g
-:%s/Анfелка/Анѓелка/g
-:%s/рfосаната/рѓосаната/g
-:%s/оттуfуваоето/оттуѓуваоето/g
-:%s/Енfел/Енѓел/g
-:%s/Караfорѓевиќ/Караѓорѓевиќ/g
-:%s/маfународната/маѓународната/g
-:%s/Пеfа/Пеѓа/g
-:%s/маfепсник/маѓепсник/g
-:%s/Караfорѓе/Караѓорѓе/g
-:%s/Граfевинар/Граѓевинар/g
-:%s/Меfаши/Меѓаши/g
-:%s/Ванfел/Ванѓел/g
-:%s/Караfиќ/Караѓиќ/g
-:%s/Анfели/Анѓели/g
-:%s/саfи/саѓи/g
-:%s/маfионичарски/маѓионичарски/g
-:%s/Караfорѓевиќ/Караѓорѓевиќ/g
-:%s/панаfур/панаѓур/g
-:%s/Ѓерf/Ѓерѓ/g
-:%s/Ѓинѓиf/Ѓинѓиѓ/g
+===In domain===
+====LER====
-paste setimes.mk setimes.en| grep -v '^(' | cut -f1 > setimes.mk.new
-paste setimes.mk setimes.en| grep -v '^(' | cut -f2 > setimes.en.new
-paste setimes.en.new setimes.mk.new | grep -v '^(' | cut -f1 > setimes.en
-paste setimes.en.new setimes.mk.new | grep -v '^(' | cut -f2 > setimes.mk
+====BLEU====
-perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl setimes mk en setimes.clean 1 40
+===Out of domain===
+====LER====
-cat setimes.clean.mk | apertium-destxt | apertium -f none -d ~/source/apertium-mk-en/ mk-en-pretransfer > setimes.tagged.mk&
-cat setimes.clean.en | apertium-destxt | apertium -f none -d ~/source/apertium-mk-en/ en-mk-pretransfer > setimes.tagged.en&
+====BLEU====
+==MaxEnt==
+===With alignments===
+{|class=wikitable
+! Pair  || alig || rule-best ||   ME (>5) || ME (>3)
+|-
+| br-fr || 33.4 ||    31.5   ||   31.8    || 29.9
+|-
+| mk-en || 19.9 ||   19.8    ||   18.9   || 17.8
+|-
+| eu-es || 18.5 ||  17.9   ||   17.4    || 19.9
+|-
+| en-es || 8.6  ||   7.0     ||    6.3    || 6.3
+|-
+|}
+===With fractional counts===
+{|class=wikitable
+! Pair  || alig || rule-best ||   ME (>5) || ME (>3) || ME (>1) || ME (>0)
+|-
+| br-fr || 43.4 ||   43.1    ||   61.9   || 46.2    || 48.2   || 49.9
+|-
+| mk-en || 29.5 ||      ||      ||   ||
+|-
+| eu-es || 41.2  ||       ||   43.9    || 44.4  ||
+|-
+| en-es || 11.9 ||   11.7     ||  11.4  || 11.9  ||
+|-
+|}
+==Notes==
+* [http://acl.ldc.upenn.edu/W/W07/W07-0733.pdf Experiments in Domain Adaptation for Statistical Machine Translation]
-</pre>
+* [http://www.cs.sfu.ca/~anoop/papers/pdf/ssl-smt-mtjournal07.pdf Semi-supervised model adaptation for statistical machine translation]
+* [http://www.mt-archive.info/WMT-2009-Bertoldi.pdf Domain Adaptation for Statistical Machine Translation with Monolingual Resources]
+*: "We found that the largest gain (25% relative) is achieved when in-domain data are available for the target language. A smaller performance improvement is still observed (5% relative) if source adaptation data are available. We also observed that the most important role is played by the LM adaptation, while the adaptation of the TM and RM gives consistent but small improvement."

Difference between revisions of "User:Francis Tyers/Experiments"

Latest revision as of 10:49, 22 November 2012

Contents

TODO[edit]

Corpus stats[edit]

Evaluation corpus[edit]

Out of domain[edit]

In domain[edit]

EAMT-style results[edit]

Out of domain[edit]

LER[edit]

BLEU[edit]

In domain[edit]

LER[edit]

BLEU[edit]

Learning monolingually (winner-takes-all)[edit]

Out of domain[edit]

LER[edit]

BLEU[edit]

In domain[edit]

LER[edit]

BLEU[edit]

Learning monolingually (fractional counts)[edit]

In domain[edit]

LER[edit]

BLEU[edit]

Out of domain[edit]

LER[edit]

BLEU[edit]

MaxEnt[edit]

With alignments[edit]

With fractional counts[edit]

Notes[edit]

Navigation menu

Search