Difference between revisions of "Automatically generating compound bidix entries"
Line 209: | Line 209: | ||
** In {{sc|nom nom}} compounds, the first noun must always be singular in English, e.g. |
** In {{sc|nom nom}} compounds, the first noun must always be singular in English, e.g. |
||
*** vörumerkjasafn → <nowiki>*</nowiki>trademarks museum → trademark museum |
*** vörumerkjasafn → <nowiki>*</nowiki>trademarks museum → trademark museum |
||
* Post-processing to merge equivalent entries, e.g. |
|||
** If we have two entries, one for singular and one for plural, then they should be merged. |
|||
[[Category:Development]] |
[[Category:Development]] |
Revision as of 00:06, 10 March 2010
Likely compound wordlist
Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.
$ tail unk.is.txt Þýðingarstofu þynnist þynntur þyrluflug þyrluflugmaður þyrlupall þyrluslysi þyrptust þýskukennari þýskumælandi
Decompound them with lttoolbox
$ cat dev/unk.is.txt | apertium-destxt | lt-proc -e is-en.cmpd.bin | apertium-retxt > dev/unk.a.txt
Note: If you are using regular expressions (e.g. in is-en
for proper name recognition) it might be an idea to disable these as sometimes they mess up formatting.
Check to see you get the same number of lines:
$ wc -l unk.* 17181 unk.a.txt 17181 unk.is.txt 34362 total
Remove words which still don't have analyses
$ paste unk.is.txt unk.a.txt | grep -v '*' > unk.is-a.txt
Generate correspondences
Between original surface form and analyses:
$ cat > correspondence.sh for i in `cat $1 | sed 's/\t/;/g' | sed 's/ /_/g'`; do surface=`echo $i | cut -f1 -d';'`; analyses=`echo $i | sed 's/\$//g'| cut -f2 -d';' | cut -f2- -d'/' | sed 's/\^//g' | sed 's/\$//g' | sed 's/$//g'`; for j in `echo $analyses | sed 's/\//\n/g'`; do echo $j | grep '\$' > /dev/null if [ $? -eq 1 ]; then echo -e "["$surface"].[]\t^"$j"\$"; else echo -e "["$surface"].[]\t^"$j; fi done; done $ sh correspondence.sh unk.is-a.txt > unk.1line.txt
As we are considering all possible examples there will be many bad analyses along with good ones:
Good "bardagaíþróttir" = "combat sports":
[bardagaíþróttir].[] ^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$
Bad: "bikarmeistara" != "tar master":
[bikarmeistara].[] ^bika<vblex><actv><pri><p3><sg>+meistari<n><m><sg><acc><ind>$
So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model.
$ wc -l unk.1line.txt 48248 unk.1line.txt
Translate
Run the file you just made through the rest of the pipeline:
$ cat unk.1line.txt | sed 's/$/^.<sent>$/g' | /home/fran/local/bin/apertium-pretransfer|\ apertium-transfer /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t1x /home/fran/local/share/apertium/apertium-is-en/is-en.t1x.bin /home/fran/local/share/apertium/apertium-is-en/is-en.autobil.bin |\ apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t2x /home/fran/local/share/apertium/apertium-is-en/is-en.t2x.bin |\ apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t3x /home/fran/local/share/apertium/apertium-is-en/is-en.t3x.bin |\ apertium-postchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t4x /home/fran/local/share/apertium/apertium-is-en/is-en.t4x.bin |\ lt-proc -g /home/fran/local/share/apertium/apertium-is-en/is-en.autogen.bin |\ lt-proc -p /home/fran/local/share/apertium/apertium-is-en/is-en.autopgen.bin > unk.trans.txt
And check that the file lengths are the same:
$ wc -l unk.1line.txt unk.trans.txt 48248 unk.1line.txt 48248 unk.trans.txt 96496 total
Score with LM
$ cat unk.trans.txt | cut -f2- | sed 's/\./ ./g' | sed 's/#//g' | sh ~/scripts/lowercase.sh > unk.trans.toscore.txt $ cat unk.trans.toscore.txt | ~/source/apertium/trunk/apertium-lex-learner/irstlm-ranker ~/corpora/català/en.blm > unk.trans.scored.txt
$ paste unk.trans.scored.txt unk.1line.txt | head -1.04283 || plan machine . [áætlunarvél].[] ^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><nom><ind>$ -1.04283 || plan machine . [áætlunarvél].[] ^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><acc><ind>$ -1.04283 || plan machine . [áætlunarvél].[] ^áætlun<n><f><sg><gen><ind>+vél<n><f><sg><dat><ind>$ -1.65361 || abbot experienced . [ábótavant].[] ^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><nom><sta>$ -1.65361 || abbot experienced . [ábótavant].[] ^ábóti<n><m><sg><acc><ind>+vanur<adj><pst><nt><sg><acc><sta>$ -1.65361 || abbot experienced . [ábótavant].[] ^ábóti<n><m><sg><dat><ind>+vanur<adj><pst><nt><sg><nom><sta>$
And sort:
$ paste unk.trans.scored.txt unk.1line.txt | sort -gr | head -0.999912 || customs gate . [tollhlið].[] ^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><nom><ind>$ -0.999912 || customs gate . [tollhlið].[] ^tollur<n><m><sg><acc><ind>+hlið<n><nt><sg><acc><ind>$ -0.999877 || employees rents . [starfsmannaleigum].[] ^starfsmaður<n><m><pl><gen><ind>+leiga<n><f><pl><dat><ind>$ -0.999793 || health effort . [heilsuátaki].[] ^heilsa<n><f><sg><gen><ind>+átak<n><nt><sg><dat><ind>$ -0.999793 || health effort . [heilsuátaki].[] ^heilsa<n><f><sg><dat><ind>+átak<n><nt><sg><dat><ind>$ -0.999793 || health effort . [heilsuátaki].[] ^heilsa<n><f><sg><acc><ind>+átak<n><nt><sg><dat><ind>$ -0.999014 || sport articles . [íþróttgreinar].[] ^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><nom><ind>$ -0.999014 || sport articles . [íþróttgreinar].[] ^íþrótt<n><f><sg><nom><ind>+grein<n><f><pl><acc><ind>$ -0.999014 || sport articles . [íþróttgreinar].[] ^íþrótt<n><f><sg><nom><ind>+greinir<n><m><pl><nom><ind>$ -0.999014 || sport articles . [íþróttgreinar].[] ^íþrótt<n><f><sg><dat><ind>+grein<n><f><pl><nom><ind>$
Convert to bidix
#!/usr/bin/python2.5 # coding=utf-8 # -*- encoding: utf-8 -*- import sys, codecs, copy, re, commands, os; sys.stdin = codecs.getreader('utf-8')(sys.stdin); sys.stdout = codecs.getwriter('utf-8')(sys.stdout); sys.stderr = codecs.getwriter('utf-8')(sys.stderr); #-0.999014 || sport articles . [íþróttgreinar].[] ^íþrótt<n><f><sg><acc><ind>+grein<n><f><pl><nom><ind>$ def generate(s): #{ gen = '^' + s + '$'; cmd = 'echo "' + gen + '" | lt-proc -g ' + sys.argv[2]; return commands.getstatusoutput(cmd)[1]; #} def tags_to_bidix(t): #{ pos = t[0].strip('><'); gender = t[1].strip('><'); bdtags = '<s n="' + pos + '"/><s n="' + gender + '"/>'; return bdtags; #} for line in file(sys.argv[1]).read().split('\n'): #{ row = line.split('\t'); prob = row[0]; translation = row[2].strip('. '); analysis = row[4].strip('^$'); analysis_row = analysis.split('+'); head = analysis_row[len(analysis_row)-1]; queue = analysis_row[0:-1]; q = ''; for part in queue: #{ q = q + generate(part); #} #print prob , queue , q + head , translation; tags = head.split('<')[1:]; lemma = head.split('<')[0]; left = q + lemma + tags_to_bidix(tags); print '<!-- p: ' + str(prob) + ' --> <e><p><l>' + left + '</l><r>' + translation.replace(' ', '<b/>') + '<s n="n"/></r></p></e>'; #}
$ python sorted-to-bidix.py unk.trans.sorted.txt ../is.gen.bin > unk.bidix $ cat unk.bidix | sort -ur <!-- p: -0.999912 --> <e><p><l>tollhlið<s n="n"/><s n="nt"/></l><r>customs<b/>gate<s n="n"/></r></p></e> <!-- p: -0.999877 --> <e><p><l>starfsmannaleiga<s n="n"/><s n="f"/></l><r>employees<b/>rents<s n="n"/></r></p></e> <!-- p: -0.999793 --> <e><p><l>heilsuátak<s n="n"/><s n="nt"/></l><r>health<b/>effort<s n="n"/></r></p></e> <!-- p: -0.999014 --> <e><p><l>viðskiptagrein<s n="n"/><s n="f"/></l><r>businesses<b/>articles<s n="n"/></r></p></e> <!-- p: -0.999014 --> <e><p><l>viðskiptagreinir<s n="n"/><s n="m"/></l><r>businesses<b/>articles<s n="n"/></r></p></e> <!-- p: -0.999014 --> <e><p><l>íþróttgrein<s n="n"/><s n="f"/></l><r>sport<b/>articles<s n="n"/></r></p></e>
Perspectives for improvement
- Removing impossible combinations (e.g.
<dat> + <nom>
or<pr> + <nom>
compounds) - Special transfer rules for translating separated compound words.
- In nom nom compounds, the first noun must always be singular in English, e.g.
- vörumerkjasafn → *trademarks museum → trademark museum
- In nom nom compounds, the first noun must always be singular in English, e.g.
- Post-processing to merge equivalent entries, e.g.
- If we have two entries, one for singular and one for plural, then they should be merged.