Difference between revisions of "Automatically generating compound bidix entries"

Revision as of 00:06, 10 March 2010

Likely compound wordlist

Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.

$ tail unk.is.txt 
Þýðingarstofu
þynnist
þynntur
þyrluflug
þyrluflugmaður
þyrlupall
þyrluslysi
þyrptust
þýskukennari
þýskumælandi

Decompound them with lttoolbox

$ cat dev/unk.is.txt | apertium-destxt | lt-proc -e is-en.cmpd.bin | apertium-retxt  > dev/unk.a.txt

Note: If you are using regular expressions (e.g. in is-en for proper name recognition) it might be an idea to disable these as sometimes they mess up formatting.

Check to see you get the same number of lines:

$ wc -l unk.*
  17181 unk.a.txt
  17181 unk.is.txt
  34362 total

Remove words which still don't have analyses

$ paste unk.is.txt unk.a.txt  | grep -v '*' > unk.is-a.txt

Generate correspondences

Between original surface form and analyses:

$ cat > correspondence.sh
for i in `cat $1 | sed 's/\t/;/g' | sed 's/ /_/g'`; do 
        surface=`echo $i | cut -f1 -d';'`;
        analyses=`echo $i | sed 's/\$//g'| cut -f2 -d';' | cut -f2- -d'/' | sed 's/\^//g' | sed 's/\$//g' | sed 's/$//g'`;
        for j in `echo $analyses | sed 's/\//\n/g'`; do
                echo $j | grep '\$' > /dev/null
                if [ $? -eq 1 ]; then
                        echo -e "["$surface"].[]\t^"$j"\$";
                else
                        echo -e "["$surface"].[]\t^"$j;
                fi
        done;
done


$ sh correspondence.sh unk.is-a.txt > unk.1line.txt

As we are considering all possible examples there will be many bad analyses along with good ones:

Good "bardagaíþróttir" = "combat sports":

[bardagaíþróttir].[]	^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$

Bad: "bikarmeistara" != "tar master":

[bikarmeistara].[]	^bika<vblex><actv><pri><p3><sg>+meistari<n><m><sg><acc><ind>$

So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model.

$ wc -l unk.1line.txt 
48248 unk.1line.txt

Translate

Run the file you just made through the rest of the pipeline:

$ cat unk.1line.txt | sed 's/$/^.<sent>$/g' | /home/fran/local/bin/apertium-pretransfer|\
  apertium-transfer /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t1x  /home/fran/local/share/apertium/apertium-is-en/is-en.t1x.bin  /home/fran/local/share/apertium/apertium-is-en/is-en.autobil.bin |\
  apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t2x  /home/fran/local/share/apertium/apertium-is-en/is-en.t2x.bin |\
  apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t3x  /home/fran/local/share/apertium/apertium-is-en/is-en.t3x.bin |\ 
  apertium-postchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t4x  /home/fran/local/share/apertium/apertium-is-en/is-en.t4x.bin |\
  lt-proc -g  /home/fran/local/share/apertium/apertium-is-en/is-en.autogen.bin |\
  lt-proc -p /home/fran/local/share/apertium/apertium-is-en/is-en.autopgen.bin > unk.trans.txt

And check that the file lengths are the same:

$ wc -l unk.1line.txt unk.trans.txt 
  48248 unk.1line.txt
  48248 unk.trans.txt
  96496 total

Score with LM

$ cat unk.trans.txt | cut -f2- | sed 's/\./ ./g' | sed 's/#//g' | sh ~/scripts/lowercase.sh > unk.trans.toscore.txt
$ cat unk.trans.toscore.txt | ~/source/apertium/trunk/apertium-lex-learner/irstlm-ranker ~/corpora/català/en.blm > unk.trans.scored.txt

$ paste unk.trans.scored.txt unk.1line.txt | head
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><nom><ind>$
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><acc><ind>$
-1.04283	||	plan machine .	[áætlunarvél].[]	^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><dat><ind>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><nom><sta>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><acc><sta>$
-1.65361	||	abbot experienced .	[ábótavant].[]	^ábóti<n><m><sg><dat><ind>+vanur<adj><pst><nt><sg><nom><sta>$

And sort:

$ paste unk.trans.scored.txt unk.1line.txt | sort -gr | head
-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><nom><ind>$
-0.999912	||	customs gate .	[tollhlið].[]	^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><acc><ind>$
-0.999877	||	employees rents .	[starfsmannaleigum].[]	^starfsmaður<n><m><pl><gen><ind>+leiga<n><f><pl><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><gen><ind>+átak<n><nt><sg><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><dat><ind>+átak<n><nt><sg><dat><ind>$
-0.999793	||	health effort .	[heilsuátaki].[]	^heilsa<n><f><sg><acc><ind>+átak<n><nt><sg><dat><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><nom><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><acc><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><nom><ind>+greinir<n><m><pl><nom><ind>$
-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><dat><ind>+grein<n><f><pl><nom><ind>$

Convert to bidix

#!/usr/bin/python2.5
# coding=utf-8
# -*- encoding: utf-8 -*-

import sys, codecs, copy, re, commands, os;

sys.stdin = codecs.getreader('utf-8')(sys.stdin);
sys.stdout = codecs.getwriter('utf-8')(sys.stdout);
sys.stderr = codecs.getwriter('utf-8')(sys.stderr);

#-0.999014	||	sport articles .	[íþróttgreinar].[]	^íþrótt<n><f><sg><acc><ind>+grein<n><f><pl><nom><ind>$

def generate(s): #{
	gen = '^' + s + '$';
	cmd = 'echo "' + gen + '" | lt-proc -g ' + sys.argv[2];
	return commands.getstatusoutput(cmd)[1];
#}

def tags_to_bidix(t): #{
	pos = t[0].strip('><');
	gender = t[1].strip('><');
	bdtags = '<s n="' + pos + '"/><s n="' + gender + '"/>';
	return bdtags;
#}

for line in file(sys.argv[1]).read().split('\n'): #{

	row = line.split('\t');

	prob = row[0];
	translation = row[2].strip('. ');
	analysis = row[4].strip('^$');
	analysis_row = analysis.split('+');
	head = analysis_row[len(analysis_row)-1];
	queue = analysis_row[0:-1];
	q = '';
	for part in queue: #{
		q = q + generate(part);	
	#}	
	#print prob , queue , q + head , translation;
	tags = head.split('<')[1:];
	lemma = head.split('<')[0];
	left = q + lemma + tags_to_bidix(tags);
	print '<!-- p: ' + str(prob) + ' --> <e><p><l>' + left + '</l><r>' + translation.replace(' ', '<b/>') + '<s n="n"/></r></p></e>';
#}

$ python sorted-to-bidix.py unk.trans.sorted.txt ../is.gen.bin > unk.bidix

$ cat unk.bidix  | sort -ur
<!-- p: -0.999912 --> <e><p><l>tollhlið<s n="n"/><s n="nt"/></l><r>customs<b/>gate<s n="n"/></r></p></e>
<!-- p: -0.999877 --> <e><p><l>starfsmannaleiga<s n="n"/><s n="f"/></l><r>employees<b/>rents<s n="n"/></r></p></e>
<!-- p: -0.999793 --> <e><p><l>heilsuátak<s n="n"/><s n="nt"/></l><r>health<b/>effort<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>viðskiptagrein<s n="n"/><s n="f"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>viðskiptagreinir<s n="n"/><s n="m"/></l><r>businesses<b/>articles<s n="n"/></r></p></e>
<!-- p: -0.999014 --> <e><p><l>íþróttgrein<s n="n"/><s n="f"/></l><r>sport<b/>articles<s n="n"/></r></p></e>

Perspectives for improvement

Removing impossible combinations (e.g. <dat> + <nom> or <pr> + <nom> compounds)
Special transfer rules for translating separated compound words.
- In nom nom compounds, the first noun must always be singular in English, e.g.
  - vörumerkjasafn → *trademarks museum → trademark museum
Post-processing to merge equivalent entries, e.g.
- If we have two entries, one for singular and one for plural, then they should be merged.

@@ Line 209: / Line 209: @@
 ** In {{sc|nom nom}} compounds, the first noun must always be singular in English, e.g.
 *** vörumerkjasafn → <nowiki>*</nowiki>trademarks museum → trademark museum
+* Post-processing to merge equivalent entries, e.g.
+** If we have two entries, one for singular and one for plural, then they should be merged.
 [[Category:Development]]

Difference between revisions of "Automatically generating compound bidix entries"

Revision as of 00:06, 10 March 2010

Contents

Likely compound wordlist

Decompound them with lttoolbox

Remove words which still don't have analyses

Generate correspondences

Translate

Score with LM

Convert to bidix

Perspectives for improvement

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools