Automatically generating compound bidix entries

From Apertium
Jump to navigation Jump to search

Rationale[edit]

  • We have an existing translator, with a language pair where the source side has productive compounding.
  • The source dictionary and bilingual dictionary are not so big, so we miss frequent words, and we also miss compounds
  • We want to be able to translate compounds, but not necessarily on-line because of the high failure rate, due to either disambiguation or transfer, just incomplete entries.
  • We can try and generate compound entries for bilingual dictionary and monolingual target language dictionary, offline, proofread them and include them just as normal.
  • This process should be automated as much as possible.

Process[edit]

Likely compound wordlist[edit]

Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.

$ tail unk.is.txt 
Þýðingarstofu
þynnist
þynntur
þyrluflug
þyrluflugmaður
þyrlupall
þyrluslysi
þyrptust
þýskukennari
þýskumælandi

Decompound them with lttoolbox[edit]

$ cat dev/unk.is.txt | apertium-destxt | lt-proc -e is-en.cmpd.bin | apertium-retxt  > dev/unk.a.txt

Note: If you are using regular expressions (e.g. in is-en for proper name recognition) it might be an idea to disable these as sometimes they mess up formatting.

Check to see you get the same number of lines:

$ wc -l unk.*
  17181 unk.a.txt
  17181 unk.is.txt
  34362 total

Remove words which still don't have analyses[edit]

$ paste unk.is.txt unk.a.txt  | grep -v '*' > unk.is-a.txt

Generate correspondences[edit]

Between original surface form and analyses:

$ cat > correspondence.sh
for i in `cat $1 | sed 's/\t/;/g' | sed 's/ /_/g'`; do 
        surface=`echo $i | cut -f1 -d';'`;
        analyses=`echo $i | sed 's/\$//g'| cut -f2 -d';' | cut -f2- -d'/' | sed 's/\^//g' | sed 's/\$//g' | sed 's/$//g'`;
        for j in `echo $analyses | sed 's/\//\n/g'`; do
                echo $j | grep '\$' > /dev/null
                if [ $? -eq 1 ]; then
                        echo -e "["$surface"].[]\t^"$j"\$";
                else
                        echo -e "["$surface"].[]\t^"$j;
                fi
        done;
done


$ sh correspondence.sh unk.is-a.txt > unk.1line.txt

As we are considering all possible examples there will be many bad analyses along with good ones:

Good "bardagaíþróttir" = "combat sports":

[bardagaíþróttir].[]	^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$

Bad: "bikarmeistara" != "tar master":

[bikarmeistara].[]	^bika<vblex><actv><pri><p3><sg>+meistari<n><m><sg><acc><ind>$

So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model.

$ wc -l unk.1line.txt 
48248 unk.1line.txt

Translate[edit]

Run the file you just made through the rest of the pipeline:

$ cat unk.1line.txt | sed 's/$/^.<sent>$/g' | /home/fran/local/bin/apertium-pretransfer|\
  apertium-transfer /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t1x  /home/fran/local/share/apertium/apertium-is-en/is-en.t1x.bin  /home/fran/local/share/apertium/apertium-is-en/is-en.autobil.bin |\
  apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t2x  /home/fran/local/share/apertium/apertium-is-en/is-en.t2x.bin |\
  apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t3x  /home/fran/local/share/apertium/apertium-is-en/is-en.t3x.bin |\ 
  apertium-postchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t4x  /home/fran/local/share/apertium/apertium-is-en/is-en.t4x.bin |\
  lt-proc -g  /home/fran/local/share/apertium/apertium-is-en/is-en.autogen.bin |\
  lt-proc -p /home/fran/local/share/apertium/apertium-is-en/is-en.autopgen.bin > unk.trans.txt

And check that the file lengths are the same:

$ wc -l unk.1line.txt unk.trans.txt 
  48248 unk.1line.txt
  48248 unk.trans.txt
  96496 total

Score with LM[edit]

$ cat unk.trans.txt | cut -f2- | sed 's/\./ ./g' | sed 's/#//g' | sh ~/scripts/lowercase.sh > unk.trans.toscore.txt
$ cat unk.trans.toscore.txt | ~/source/apertium/trunk/apertium-lex-learner/irstlm-ranker ~/corpora/català/en.blm > unk.trans.scored.txt
$ paste unk.trans.scored.txt unk.1line.txt | head
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><nom><ind>$
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><acc><ind>$
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><dat><ind>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><nom><sta>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><acc><sta>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><dat><ind>+vanur<adj><pst><nt><sg><nom><sta>$

And sort:

$ paste unk.trans.scored.txt unk.1line.txt | sort -gr | head
-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><nom><ind>$
-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><acc><ind>$
-0.999877	||	employees rents .	[starfsmannaleigum].[]	^starfsmaður<n><m><pl><gen><ind>+leiga<n><f><pl><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><gen><ind>+átak<n><nt><sg><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><dat><ind>+átak<n><nt><sg><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><acc><ind>+átak<n><nt><sg><dat><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><nom><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><acc><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+greinir<n><m><pl><nom><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><dat><ind>+grein<n><f><pl><nom><ind>$

Convert to bidix[edit]

#!/usr/bin/python2.5
# coding=utf-8
# -*- encoding: utf-8 -*-

import sys, codecs, copy, re, commands, os;

sys.stdin = codecs.getreader('utf-8')(sys.stdin);
sys.stdout = codecs.getwriter('utf-8')(sys.stdout);
sys.stderr = codecs.getwriter('utf-8')(sys.stderr);

#-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><acc><ind>+grein<n><f><pl><nom><ind>$

def generate(s): #{
	gen = '^' + s + '$';
	cmd = 'echo "' + gen + '" | lt-proc -g ' + sys.argv[2];
	return commands.getstatusoutput(cmd)[1];
#}

def tags_to_bidix(t): #{
	pos = t[0].strip('><');
	gender = t[1].strip('><');
	bdtags = '<s n="' + pos + '"/><s n="' + gender + '"/>';
	return bdtags;
#}

for line in file(sys.argv[1]).read().split('\n'): #{

	row = line.split('\t');

	prob = row[0];
	translation = row[2].strip('. ');
	analysis = row[4].strip('^$');
	analysis_row = analysis.split('+');
	head = analysis_row[len(analysis_row)-1];
	queue = analysis_row[0:-1];
	q = '';
	for part in queue: #{
		q = q + generate(part);	
	#}	
	#print prob , queue , q + head , translation;
	tags = head.split('<')[1:];
	lemma = head.split('<')[0];
	left = q + lemma + tags_to_bidix(tags);
	print '<!-- p: ' + str(prob) + ' --> <e><p><l>' + left + '</l><r>' + translation.replace(' ', '<b/>') + '<s n="n"/></r></p></e>';
#}
$ python sorted-to-bidix.py unk.trans.sorted.txt ../is.gen.bin > unk.bidix

$ cat unk.bidix  | sort -ur
<!-- p: -0.999912 --> <e><p><l>tollhlið<s n="n"/><s n="nt"/></l><r>customs<b/>gate<s n="n"/></r></p></e>
<!-- p: -0.999877 --> <e><p><l>starfsmannaleiga<s n="n"/><s n="f"/></l><r>employees<b/>rents<s n="n"/></r></p></e>
<!-- p: -0.999793 --> <e><p><l>heilsuátak<s n="n"/><s n="nt"/></l><r>health<b/>effort<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>viðskiptagrein<s n="n"/><s n="f"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>viðskiptagreinir<s n="n"/><s n="m"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>íþróttgrein<s n="n"/><s n="f"/></l><r>sport<b/>articles<s n="n"/></r></p></e>

Perspectives for improvement[edit]

  • Removing impossible combinations (e.g. <dat> + <nom> or <pr> + <nom> compounds)
  • Special transfer rules for translating separated compound words.
    • In nom nom compounds, the first noun must always be singular in English, e.g.
      • vörumerkjasafn → *trademarks museum → trademark museum
  • Post-processing to merge equivalent entries, e.g.
    • If we have two entries, one for singular and one for plural, then they should be merged.
  • Alternative decompounding strategies, e.g.
    • We cannot decompound "þýskumælandi" 'German speaker' with lrlm because "þýskum" is a dative form of 'German' and the longest match.
  • Read the bilingual dictionary and generate all possible slr entries.