Difference between revisions of "Automatically generating compound bidix entries"
Jump to navigation
Jump to search
Line 79: | Line 79: | ||
So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model. |
So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model. |
||
<pre> |
|||
$ wc -l unk.1line.txt |
|||
48248 unk.1line.txt |
|||
</pre> |
|||
==Translate== |
|||
Run the file you just made through the rest of the pipeline: |
|||
<pre> |
|||
$ cat unk.1line.txt | sed 's/$/^.<sent>$/g' | /home/fran/local/bin/apertium-pretransfer|\ |
|||
apertium-transfer /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t1x /home/fran/local/share/apertium/apertium-is-en/is-en.t1x.bin /home/fran/local/share/apertium/apertium-is-en/is-en.autobil.bin |\ |
|||
apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t2x /home/fran/local/share/apertium/apertium-is-en/is-en.t2x.bin |\ |
|||
apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t3x /home/fran/local/share/apertium/apertium-is-en/is-en.t3x.bin |\ |
|||
apertium-postchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t4x /home/fran/local/share/apertium/apertium-is-en/is-en.t4x.bin |\ |
|||
lt-proc -g /home/fran/local/share/apertium/apertium-is-en/is-en.autogen.bin |\ |
|||
lt-proc -p /home/fran/local/share/apertium/apertium-is-en/is-en.autopgen.bin > unk.trans.txt |
|||
</pre> |
|||
[[Category:Development]] |
[[Category:Development]] |
Revision as of 13:58, 9 March 2010
Likely compound wordlist
Ideas: Remove words that have >1 capital letter, remove words with < 7 letters.
$ tail unk.is.txt Þýðingarstofu þynnist þynntur þyrluflug þyrluflugmaður þyrlupall þyrluslysi þyrptust þýskukennari þýskumælandi
Decompound them with lttoolbox
$ cat dev/unk.is.txt | apertium-destxt | lt-proc -e is-en.cmpd.bin | apertium-retxt > dev/unk.a.txt
Note: If you are using regular expressions (e.g. in is-en
for proper name recognition) it might be an idea to disable these as sometimes they mess up formatting.
Check to see you get the same number of lines:
$ wc -l unk.* 17181 unk.a.txt 17181 unk.is.txt 34362 total
Remove words which still don't have analyses
$ paste unk.is.txt unk.a.txt | grep -v '*' > unk.is-a.txt
Generate correspondences
Between original surface form and analyses:
$ cat > correspondence.sh for i in `cat $1 | sed 's/\t/;/g' | sed 's/ /_/g'`; do surface=`echo $i | cut -f1 -d';'`; analyses=`echo $i | sed 's/\$//g'| cut -f2 -d';' | cut -f2- -d'/' | sed 's/\^//g' | sed 's/\$//g' | sed 's/$//g'`; for j in `echo $analyses | sed 's/\//\n/g'`; do echo $j | grep '\$' > /dev/null if [ $? -eq 1 ]; then echo -e "["$surface"].[]\t^"$j"\$"; else echo -e "["$surface"].[]\t^"$j; fi done; done $ sh correspondence.sh unk.is-a.txt > unk.1line.txt
As we are considering all possible examples there will be many bad analyses along with good ones:
Good "bardagaíþróttir" = "combat sports":
[bardagaíþróttir].[] ^bardagi<n><m><sg><gen><def>+íþrótt<n><f><pl><nom><ind>$
Bad: "bikarmeistara" != "tar master":
[bikarmeistara].[] ^bika<vblex><actv><pri><p3><sg>+meistari<n><m><sg><acc><ind>$
So, in order to try and sift out the good compounds from the bad, we can try and translate them and rank them with a language model.
$ wc -l unk.1line.txt 48248 unk.1line.txt
Translate
Run the file you just made through the rest of the pipeline:
$ cat unk.1line.txt | sed 's/$/^.<sent>$/g' | /home/fran/local/bin/apertium-pretransfer|\ apertium-transfer /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t1x /home/fran/local/share/apertium/apertium-is-en/is-en.t1x.bin /home/fran/local/share/apertium/apertium-is-en/is-en.autobil.bin |\ apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t2x /home/fran/local/share/apertium/apertium-is-en/is-en.t2x.bin |\ apertium-interchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t3x /home/fran/local/share/apertium/apertium-is-en/is-en.t3x.bin |\ apertium-postchunk /home/fran/local/share/apertium/apertium-is-en/apertium-is-en.is-en.t4x /home/fran/local/share/apertium/apertium-is-en/is-en.t4x.bin |\ lt-proc -g /home/fran/local/share/apertium/apertium-is-en/is-en.autogen.bin |\ lt-proc -p /home/fran/local/share/apertium/apertium-is-en/is-en.autopgen.bin > unk.trans.txt