User:Ragib06/GSoC2011 Challenge Answer

From Apertium
Jump to navigation Jump to search

Challenge 1: Updating Bengali-English Bilingual Dictionary

Difficulty: Easy

Title: Update the Bengali English Bilingual Dictionary with 100 new entries. Instructions on how to add more entries are described in the article Updating Bilingual Dictionary

Deliverable: A patch file

Answer

Assuming that we are in the apertium-bn-en folder, we can run this to extract the adjectives from apertium-bn-en.bn.dix :

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

There are 866 adjectives expanded. Now, to get the adjective list from the current system we can run this (according to the article mentioned above):

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 | \
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \
perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

And we get 87 adjectives. Now, saving the adjective lists, we can run a simple script to find around 100 adjetives from those 866-87 = 779 non-listed entries. Say, we save these new adjectives into a file "newadj.txt". Now, we need to put the lemma mappings manually. A sample may be like this:

অজানা<adj><mf>	unknown
তরল<adj><mf>	liquid
মানসিক<adj><mf>	mental
অসহায়<adj><mf>	helpless

Now, for pos tagging we can again use the code from the mentioned article:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 | tee /tmp/bar1 |  lt-proc en-bn.automorf.bin |  \
perl -pe 's/(\^)(.*?)(\/)//' | perl -pe 's/<comp>|<sup>//' | perl -pe 's/\/?\w*?(<vblex>|<n>|<adv>|<pr>|<prn>|<np>|<predet>)(<\w+>)*(\/|\$)?//g' | \
perl -pe 's/\$//g' > /tmp/bar2 && paste /tmp/bar1 /tmp/bar2 | awk '{ if($2 ~ /<adj>/) { print $0 } else { print $1"\t"$1"<adj>" } }' > /tmp/bar3

Finally, to put things together we run this:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f1 | perl -pe 's/ /<b\/>/g' > /tmp/bar4 && cat /tmp/bar3 | cut -f2 | 
perl -pe 's/ /<b\/>/g' > /tmp/bar5
paste /tmp/bar4 /tmp/bar5 | perl -pe 's/<(\w+)>/<s n="$1"\/>/g' | awk -F'\t' '{ print "    <e><p><l>"$1"</l><r>"$2"</r></p></e>" }' > apertium-bn-en.bn-en.dix.patch

And we have our new entries in apertium-bn-en.bn-en.dix.patch. Some samples:

    <e><p><l>অজানা<s n="adj"/><s n="mf"/></l><r>unknown<s n="adj"/></r></p></e>
    <e><p><l>তরল<s n="adj"/><s n="mf"/></l><r>liquid<s n="adj"/></r></p></e>
    <e><p><l>মানসিক<s n="adj"/><s n="mf"/></l><r>mental<s n="adj"/></r></p></e>
    <e><p><l>অসহায়<s n="adj"/><s n="mf"/></l><r>helpless<s n="adj"/></r></p></e>

Now, we need to check if the english lemmas that are entered in the new list, are actually present in the apertium-bn-en.en.dix. To do this, we can simply generate what we intend to add and what are already there. Then just check if ok:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 > myadjs.txt
lt-expand apertium-bn-en.en.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<.*//g' | python dev/uniq.py > existingadjs.txt

Comparing these two files with a simple script there were 40 entries found those are not in "existingadjs.txt" thus not in apertium-bn-en.en.dix. We need to add those to get the transfer system working with the new entries.