User:Ragib06/GSoC2011 Challenge Answer

From Apertium
Jump to navigation Jump to search

Challenge 1: Updating Bengali-English Bilingual Dictionary

Difficulty: Easy

Title: Update the Bengali English Bilingual Dictionary with 100 new entries. Instructions on how to add more entries are described in the article Updating Bilingual Dictionary

Deliverable: A patch file

Answer

Assuming that we are in the apertium-bn-en folder, we can run this to extract the adjectives from apertium-bn-en.bn.dix :

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

Say there are #1 adjectives expanded. Now, to get the adjective list from the current system we can run this (according to the article mentioned above):

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 | \
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \
perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

And we get, say #2 adjectives. Now, saving the adjective lists, we can run a simple script to find around 100 adjectives from those (#1 - #2) non-listed entries. Say, we save these new adjectives into a file "newadj.txt". Now, we need to put the lemma mappings manually. A sample may be like this:

অজানা<adj><mf>	unknown
তরল<adj><mf>	liquid
মানসিক<adj><mf>	mental
ঝুঁকিপূর্ণ<adj><mf>	risky

Now, for pos-tagging we can again use the code from the mentioned article:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 | tee /tmp/bar1 |  lt-proc en-bn.automorf.bin |  \
perl -pe 's/(\^)(.*?)(\/)//' | perl -pe 's/<comp>|<sup>//' | perl -pe 's/\/?\w*?(<vblex>|<n>|<adv>|<pr>|<prn>|<np>|<predet>)(<\w+>)*(\/|\$)?//g' | \
perl -pe 's/\$//g' > /tmp/bar2 && paste /tmp/bar1 /tmp/bar2 | awk '{ if($2 ~ /<adj>/) { print $0 } else { print $1"\t"$1"<adj>" } }' > /tmp/bar3

Finally, to put things together we run this:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f1 | perl -pe 's/ /<b\/>/g' > /tmp/bar4 && cat /tmp/bar3 | cut -f2 | 
perl -pe 's/ /<b\/>/g' > /tmp/bar5
paste /tmp/bar4 /tmp/bar5 | perl -pe 's/<(\w+)>/<s n="$1"\/>/g' | awk -F'\t' '{ print "    <e><p><l>"$1"</l><r>"$2"</r></p></e>" }' > apertium-bn-en.bn-en.dix.patch

And we have our new entries in apertium-bn-en.bn-en.dix.patch. Some samples:

<e><p><l>অজানা<s n="adj"/><s n="mf"/></l><r>unknown<s n="adj"/></r></p></e>
<e><p><l>তরল<s n="adj"/><s n="mf"/></l><r>liquid<s n="adj"/></r></p></e>
<e><p><l>মানসিক<s n="adj"/><s n="mf"/></l><r>mental<s n="adj"/></r></p></e>
<e><p><l>ঝুঁকিপূর্ণ<s n="adj"/><s n="mf"/></l><r>risky<s n="adj"/><s n="sint"/></r></p></e>

Now, we need to check if the english lemmas that are entered in the new list, are actually present in the apertium-bn-en.en.dix. To do this, we can simply generate what we intend to add and what are already there. Then just check if ok:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 > myadjs.txt
lt-expand apertium-bn-en.en.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<.*//g' | python dev/uniq.py > existingadjs.txt

Comparing these two lists there were 40 entries found those are not in apertium-bn-en.en.dix. We need to add those to the english monodix to get the transfer system working with the new entries.


Answer : More (100 new nouns)

grab the nouns from monodix:

lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.total

grab existing nouns in bdix:

lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' |
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin |
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 |
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 |
perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.existing

python code to grab unmatched entries (getnew.py):

import sys
if len(sys.argv) != 4:
	print 'wrong/no argument'
	sys.exit();
FILE1 = sys.argv[1]
FILE2 = sys.argv[2]
limit = int(sys.argv[3])
li1,li2 = [],[]
with open(FILE1,'r') as f1:
	li1 = f1.readlines()
with open(FILE2,'r') as f2:
	li2 = f2.readlines()
r = list(set(li1) - set(li2))
if limit == -1 :
	limit = len(r)
for i in range(limit):
	print r[i],

to use this we just need to run:

./getnew.py noun.total noun.existing 100

And here's some of the output:

ভরা<n><mf><nn>
সূত্র<n><nt><nn>
নির্দেশ<n><mf><nn>
ধ্বংস<n><mf><nn>
কারাদণ্ড<n><mf><nn>