Difference between revisions of "Automatically generating compound bidix entries"

Latest revision as of 11:28, 24 March 2012

Rationale[edit]

We have an existing translator, with a language pair where the source side has productive compounding.
The source dictionary and bilingual dictionary are not so big, so we miss frequent words, and we also miss compounds
We want to be able to translate compounds, but not necessarily on-line because of the high failure rate, due to either disambiguation or transfer, just incomplete entries.
We can try and generate compound entries for bilingual dictionary and monolingual target language dictionary, offline, proofread them and include them just as normal.
This process should be automated as much as possible.

Process[edit]

Likely compound wordlist[edit]

Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.

$ tail unk.is.txt 
Þýðingarstofu
þynnist
þynntur
þyrluflug
þyrluflugmaður
þyrlupall
þyrluslysi
þyrptust
þýskukennari
þýskumælandi

Decompound them with lttoolbox[edit]

$ cat dev/unk.is.txt | apertium-destxt | lt-proc -e is-en.cmpd.bin | apertium-retxt  > dev/unk.a.txt

Note: If you are using regular expressions (e.g. in is-en for proper name recognition) it might be an idea to disable these as sometimes they mess up formatting.

Check to see you get the same number of lines:

$ wc -l unk.*
  17181 unk.a.txt
  17181 unk.is.txt
  34362 total

Remove words which still don't have analyses[edit]

$ paste unk.is.txt unk.a.txt  | grep -v '*' > unk.is-a.txt

Generate correspondences[edit]

Between original surface form and analyses:

$ cat > correspondence.sh
for i in `cat $1 | sed 's/\t/;/g' | sed 's/ /_/g'`; do 
        surface=`echo $i | cut -f1 -d';'`;
        analyses=`echo $i | sed 's/\$//g'| cut -f2 -d';' | cut -f2- -d'/' | sed 's/\^//g' | sed 's/\$//g' | sed 's/$//g'`;
        for j in `echo $analyses | sed 's/\//\n/g'`; do
                echo $j | grep '\$' > /dev/null
                if [ $? -eq 1 ]; then
                        echo -e "["$surface"].[]\t^"$j"\$";
                else
                        echo -e "["$surface"].[]\t^"$j;
                fi
        done;
done


$ sh correspondence.sh unk.is-a.txt > unk.1line.txt

As we are considering all possible examples there will be many bad analyses along with good ones:

Good "bardagaíþróttir" = "combat sports":

[bardagaíþróttir].[]	^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$

Bad: "bikarmeistara" != "tar master":

[bikarmeistara].[]	^bika<vblex><actv><pri><p3><sg>+meistari<n><m><sg><acc><ind>$

So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model.

$ wc -l unk.1line.txt 
48248 unk.1line.txt

Translate[edit]

Run the file you just made through the rest of the pipeline:

$ cat unk.1line.txt | sed 's/$/^.<sent>$/g' | /home/fran/local/bin/apertium-pretransfer|\
  apertium-transfer /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t1x  /home/fran/local/share/apertium/apertium-is-en/is-en.t1x.bin  /home/fran/local/share/apertium/apertium-is-en/is-en.autobil.bin |\
  apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t2x  /home/fran/local/share/apertium/apertium-is-en/is-en.t2x.bin |\
  apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t3x  /home/fran/local/share/apertium/apertium-is-en/is-en.t3x.bin |\ 
  apertium-postchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t4x  /home/fran/local/share/apertium/apertium-is-en/is-en.t4x.bin |\
  lt-proc -g  /home/fran/local/share/apertium/apertium-is-en/is-en.autogen.bin |\
  lt-proc -p /home/fran/local/share/apertium/apertium-is-en/is-en.autopgen.bin > unk.trans.txt

And check that the file lengths are the same:

$ wc -l unk.1line.txt unk.trans.txt 
  48248 unk.1line.txt
  48248 unk.trans.txt
  96496 total

Score with LM[edit]

$ cat unk.trans.txt | cut -f2- | sed 's/\./ ./g' | sed 's/#//g' | sh ~/scripts/lowercase.sh > unk.trans.toscore.txt
$ cat unk.trans.toscore.txt | ~/source/apertium/trunk/apertium-lex-learner/irstlm-ranker ~/corpora/català/en.blm > unk.trans.scored.txt

$ paste unk.trans.scored.txt unk.1line.txt | head
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><nom><ind>$
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><acc><ind>$
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><dat><ind>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><nom><sta>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><acc><sta>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><dat><ind>+vanur<adj><pst><nt><sg><nom><sta>$

And sort:

$ paste unk.trans.scored.txt unk.1line.txt | sort -gr | head
-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><nom><ind>$
-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><acc><ind>$
-0.999877	||	employees rents .	[starfsmannaleigum].[]	^starfsmaður<n><m><pl><gen><ind>+leiga<n><f><pl><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><gen><ind>+átak<n><nt><sg><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><dat><ind>+átak<n><nt><sg><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><acc><ind>+átak<n><nt><sg><dat><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><nom><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><acc><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+greinir<n><m><pl><nom><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><dat><ind>+grein<n><f><pl><nom><ind>$

Convert to bidix[edit]

#!/usr/bin/python2.5
# coding=utf-8
# -*- encoding: utf-8 -*-

import sys, codecs, copy, re, commands, os;

sys.stdin = codecs.getreader('utf-8')(sys.stdin);
sys.stdout = codecs.getwriter('utf-8')(sys.stdout);
sys.stderr = codecs.getwriter('utf-8')(sys.stderr);

#-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><acc><ind>+grein<n><f><pl><nom><ind>$

def generate(s): #{
	gen = '^' + s + '$';
	cmd = 'echo "' + gen + '" | lt-proc -g ' + sys.argv[2];
	return commands.getstatusoutput(cmd)[1];
#}

def tags_to_bidix(t): #{
	pos = t[0].strip('><');
	gender = t[1].strip('><');
	bdtags = '<s n="' + pos + '"/><s n="' + gender + '"/>';
	return bdtags;
#}

for line in file(sys.argv[1]).read().split('\n'): #{

	row = line.split('\t');

	prob = row[0];
	translation = row[2].strip('. ');
	analysis = row[4].strip('^$');
	analysis_row = analysis.split('+');
	head = analysis_row[len(analysis_row)-1];
	queue = analysis_row[0:-1];
	q = '';
	for part in queue: #{
		q = q + generate(part);	
	#}	
	#print prob , queue , q + head , translation;
	tags = head.split('<')[1:];
	lemma = head.split('<')[0];
	left = q + lemma + tags_to_bidix(tags);
	print '<!-- p: ' + str(prob) + ' --> <e><p><l>' + left + '</l><r>' + translation.replace(' ', '<b/>') + '<s n="n"/></r></p></e>';
#}

$ python sorted-to-bidix.py unk.trans.sorted.txt ../is.gen.bin > unk.bidix

$ cat unk.bidix  | sort -ur
<!-- p: -0.999912 --> <e><p><l>tollhlið<s n="n"/><s n="nt"/></l><r>customs<b/>gate<s n="n"/></r></p></e>
<!-- p: -0.999877 --> <e><p><l>starfsmannaleiga<s n="n"/><s n="f"/></l><r>employees<b/>rents<s n="n"/></r></p></e>
<!-- p: -0.999793 --> <e><p><l>heilsuátak<s n="n"/><s n="nt"/></l><r>health<b/>effort<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>viðskiptagrein<s n="n"/><s n="f"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>viðskiptagreinir<s n="n"/><s n="m"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>íþróttgrein<s n="n"/><s n="f"/></l><r>sport<b/>articles<s n="n"/></r></p></e>

Perspectives for improvement[edit]

Removing impossible combinations (e.g. <dat> + <nom> or <pr> + <nom> compounds)
Special transfer rules for translating separated compound words.
- In nom nom compounds, the first noun must always be singular in English, e.g.
  - vörumerkjasafn → *trademarks museum → trademark museum
Post-processing to merge equivalent entries, e.g.
- If we have two entries, one for singular and one for plural, then they should be merged.
Alternative decompounding strategies, e.g.
- We cannot decompound "þýskumælandi" 'German speaker' with lrlm because "þýskum" is a dative form of 'German' and the longest match.
Read the bilingual dictionary and generate all possible slr entries.

Difference between revisions of "Automatically generating compound bidix entries"

Latest revision as of 11:28, 24 March 2012

Contents

Rationale[edit]

Process[edit]

Likely compound wordlist[edit]

Decompound them with lttoolbox[edit]

Remove words which still don't have analyses[edit]

Generate correspondences[edit]

Translate[edit]

Score with LM[edit]

Convert to bidix[edit]

Perspectives for improvement[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
 {{TOCD}}
-==Likely compound wordlist==
+==Rationale==
+* We have an existing translator, with a language pair where the source side has productive compounding.
+* The source dictionary and bilingual dictionary are not so big, so we miss frequent words, and we also miss compounds
+* We want to be able to translate compounds, but not necessarily ''on-line'' because of the high failure rate, due to either disambiguation or transfer, just incomplete entries.
+* We can try and generate compound entries for bilingual dictionary and monolingual target language dictionary, ''offline'', proofread them and include them just as normal.
+* This process should be automated as much as possible.
+==Process==
+===Likely compound wordlist===
 Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.
@@ Line 18: / Line 29: @@
 </pre>
-==Decompound them with lttoolbox==
+===Decompound them with lttoolbox===
 <pre>
@@ Line 35: / Line 46: @@
 </pre>
-==Remove words which still don't have analyses==
+===Remove words which still don't have analyses===
 <pre>
@@ Line 41: / Line 52: @@
 </pre>
-==Generate correspondences==
+===Generate correspondences===
 Between original surface form and analyses:
@@ Line 84: / Line 95: @@
 </pre>
-==Translate==
+===Translate===
 Run the file you just made  through the rest of the pipeline:
@@ Line 98: / Line 109: @@
 </pre>
+And check that the file lengths are the same:
+<pre>
+$ wc -l unk.1line.txt unk.trans.txt
+unk.1line.txt
+unk.trans.txt
+total
+</pre>
+===Score with LM===
+<pre>
+$ cat unk.trans.txt | cut -f2- | sed 's/\./ ./g' | sed 's/#//g' | sh ~/scripts/lowercase.sh > unk.trans.toscore.txt
+$ cat unk.trans.toscore.txt | ~/source/apertium/trunk/apertium-lex-learner/irstlm-ranker ~/corpora/català/en.blm > unk.trans.scored.txt
+</pre>
+<pre>
+$ paste unk.trans.scored.txt unk.1line.txt | head
+-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><nom><ind>$
+-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><acc><ind>$
+-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><dat><ind>$
+-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><nom><sta>$
+-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><acc><sta>$
+-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><dat><ind>+vanur<adj><pst><nt><sg><nom><sta>$
+</pre>
+And sort:
+<pre>
+$ paste unk.trans.scored.txt unk.1line.txt | sort -gr | head
+-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><nom><ind>$
+-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><acc><ind>$
+-0.999877	||	employees rents .	[starfsmannaleigum].[]	^starfsmaður<n><m><pl><gen><ind>+leiga<n><f><pl><dat><ind>$
+-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><gen><ind>+átak<n><nt><sg><dat><ind>$
+-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><dat><ind>+átak<n><nt><sg><dat><ind>$
+-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><acc><ind>+átak<n><nt><sg><dat><ind>$
+-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><nom><ind>$
+-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><acc><ind>$
+-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+greinir<n><m><pl><nom><ind>$
+-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><dat><ind>+grein<n><f><pl><nom><ind>$
+</pre>
+===Convert to bidix===
+<pre>
+#!/usr/bin/python2.5
+# coding=utf-8
+# -*- encoding: utf-8 -*-
+import sys, codecs, copy, re, commands, os;
+sys.stdin = codecs.getreader('utf-8')(sys.stdin);
+sys.stdout = codecs.getwriter('utf-8')(sys.stdout);
+sys.stderr = codecs.getwriter('utf-8')(sys.stderr);
+#-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><acc><ind>+grein<n><f><pl><nom><ind>$
+def generate(s): #{
+	gen = '^' + s + '$';
+	cmd = 'echo "' + gen + '" | lt-proc -g ' + sys.argv[2];
+	return commands.getstatusoutput(cmd)[1];
+#}
+def tags_to_bidix(t): #{
+	pos = t[0].strip('><');
+	gender = t[1].strip('><');
+	bdtags = '<s n="' + pos + '"/><s n="' + gender + '"/>';
+	return bdtags;
+#}
+for line in file(sys.argv[1]).read().split('\n'): #{
+	row = line.split('\t');
+	prob = row[0];
+	translation = row[2].strip('. ');
+	analysis = row[4].strip('^$');
+	analysis_row = analysis.split('+');
+	head = analysis_row[len(analysis_row)-1];
+	queue = analysis_row[0:-1];
+	q = '';
+	for part in queue: #{
+		q = q + generate(part);
+	#}
+	#print prob , queue , q + head , translation;
+	tags = head.split('<')[1:];
+	lemma = head.split('<')[0];
+	left = q + lemma + tags_to_bidix(tags);
+	print '<!-- p: ' + str(prob) + ' --> <e><p><l>' + left + '</l><r>' + translation.replace(' ', '<b/>') + '<s n="n"/></r></p></e>';
+#}
+</pre>
+<pre>
+$ python sorted-to-bidix.py unk.trans.sorted.txt ../is.gen.bin > unk.bidix
+$ cat unk.bidix  | sort -ur
+<!-- p: -0.999912 --> <e><p><l>tollhlið<s n="n"/><s n="nt"/></l><r>customs<b/>gate<s n="n"/></r></p></e>
+<!-- p: -0.999877 --> <e><p><l>starfsmannaleiga<s n="n"/><s n="f"/></l><r>employees<b/>rents<s n="n"/></r></p></e>
+<!-- p: -0.999793 --> <e><p><l>heilsuátak<s n="n"/><s n="nt"/></l><r>health<b/>effort<s n="n"/></r></p></e>
+<!-- p: -0.999014 --> <e><p><l>viðskiptagrein<s n="n"/><s n="f"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
+<!-- p: -0.999014 --> <e><p><l>viðskiptagreinir<s n="n"/><s n="m"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
+<!-- p: -0.999014 --> <e><p><l>íþróttgrein<s n="n"/><s n="f"/></l><r>sport<b/>articles<s n="n"/></r></p></e>
+</pre>
+===Perspectives for improvement===
+* Removing impossible combinations (e.g. <code><dat> + <nom></code> or <code><pr> + <nom></code> compounds)
+* Special transfer rules for translating separated compound words.
+** In {{sc|nom nom}} compounds, the first noun must always be singular in English, e.g.
+*** vörumerkjasafn → <nowiki>*</nowiki>trademarks museum → trademark museum
+* Post-processing to merge equivalent entries, e.g.
+** If we have two entries, one for singular and one for plural, then they should be merged.
+* Alternative decompounding strategies, e.g.
+** We cannot decompound "þýskumælandi" 'German speaker' with {{sc|lrlm}} because "þýskum" is a dative form of 'German' and the longest match.
+* Read the bilingual dictionary and generate all possible <code>slr</code> entries.
 [[Category:Development]]
+[[Category:Documentation in English]]