Bengali and English/Updating Bilingual Dictionary

From Apertium
Jump to navigation Jump to search

We are going to try adding more adjective entries in the bn-en bdix. Assuming that we are in the apertium-bn-en folder (download it from the svn), try this,

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 | \
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \
perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

we used grep '<adj>' to filter out the adjectives, and perl -pe 's/<comp>|<sup>//g' to remove the tags inflection tags from every adjective entry. Then we used uniq.py to filter the uniq entries instead of shell's 'uniq', which is not fully Unicode compliant.

Assume that we have this output saved in dev/bdix/adjective.list file. Let's see how the file looks like in the first glance.

চিহ্নিত<adj><mf>
মঞ্চস্থ<adj><mf>
ভিন্ন<adj><mf>
শিক্ষিত<adj><mf>
শাসক<adj><mf>
অন্তর্ভুক্ত<adj><mf>

Now we are going to add corresponding English entries to this file. So after adding entries the file looks like this

চিহ্নিত<adj><mf>    marked
মঞ্চস্থ<adj><mf>    #
ভিন্ন<adj><mf>    different
শিক্ষিত<adj><mf>    educated
শাসক<adj><mf>    ruler    !
অন্তর্ভুক্ত<adj><mf>    included