Bengali and English/Updating Bilingual Dictionary
We are going to try adding more adjective entries in the bn-en bdix. Assuming that we are in the apertium-bn-en folder (download it from the svn), try this,
lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \ sed 's/$/^.<sent>$/g' | apertium-pretransfer | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \ apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin | tee /tmp/foo2 | \ lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \ perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py
we used grep '<adj>'
to filter out the adjectives, and perl -pe 's/<comp>|<sup>//g'
to remove the tags inflection tags from every adjective entry. Then we used uniq.py
to filter the uniq entries instead of shell's 'uniq'
, which is not fully Unicode compliant.
Assume that we have this output saved in dev/bdix/adjective.list
file. Let's see how the file looks like at the first glance.
চিহ্নিত<adj><mf> মঞ্চস্থ<adj><mf> ভিন্ন<adj><mf> শিক্ষিত<adj><mf> শাসক<adj><mf> অন্তর্ভুক্ত<adj><mf>
Now we are going to add corresponding English entries to this file. So after adding entries the file looks like this
চিহ্নিত<adj><mf> marked মঞ্চস্থ<adj><mf> # ভিন্ন<adj><mf> different শিক্ষিত<adj><mf> educated শাসক<adj><mf> ruler ! অন্তর্ভুক্ত<adj><mf> included
We can add these entries with the help of a an external en-bn dix, right now we are adding them manually. If you look closely, the entries are tab separated and there are few special characters. #
is for any unknown entry, !
for a different pos (here, the entry 'শাসক' was mistakenly tagged adjective in the monodix) and ?
is for any other confusion. We are going to discard any entry with any of these symbols in any subsequent processing.
After adding all the entries, we can try pos-tagging the English words. We're now going to use en-es dix for this purpose, since it has more entries in the eng monodix than ours.
cat dev/bdix/adjective.list | egrep -v '\#|\?|\!' | cut -f2 | tee /tmp/bar1 | lt-proc ../../trunk/apertium-en-es-0.6/en-es.automorf.bin | \ perl -pe 's/(\^)(.*?)(\/)//' | perl -pe 's/<comp>|<sup>//' | perl -pe 's/\/?\w*?(<vblex>|<n>|<adv>|<pr>|<prn>|<np>|<predet>)(<\w+>)*(\/|\$)?//g' | \ perl -pe 's/\$//g' > /tmp/bar2 && paste /tmp/bar1 /tmp/bar2 | awk '{ if($2 ~ /<adj>/) { print $0 } else { print $1"\t"$1"<adj>" } }' > /tmp/bar3
Surely, there will be entries which are not in the en-es monodix, we know for sure that they are adjectives, so we try tagging them with <adj>
.
Now we open /tmp/bar3
file, check the entries, specially the ones marked as only <adj>
, we need to mark them also with <sint>
if they are synthetic.
We can now try merging English and Bengali tagged parts to bdix entries, we'll also need change the tags to bdix format.
cat dev/bdix/adjective.list | egrep -v '\#|\?|\!' | cut -f1 | perl -pe 's/ /<b\/>/g' > /tmp/bar4 && cat /tmp/bar3 | cut -f2 | perl -pe 's/ /<b\/>/g' > /tmp/bar5 paste /tmp/bar4 /tmp/bar5 | perl -pe 's/<(\w+)>/<s n="$1"\/>/g' | awk -F'\t' '{ print " <e><p><l>"$1"</l><r>"$2"</r></p></e>" }'
The output will look something like this.
<e><p><l>চিহ্নিত<s n="adj"/><s n="mf"/></l><r>marked<s n="adj"/></r></p></e> <e><p><l>ভিন্ন<s n="adj"/><s n="mf"/></l><r>different<s n="adj"/></r></p></e> <e><p><l>শিক্ষিত<s n="adj"/><s n="mf"/></l><r>educated<s n="adj"/></r></p></e> <e><p><l>অন্তর্ভুক্ত<s n="adj"/><s n="mf"/></l><r>included<s n="adj"/></r></p></e>
We can now safely add these entries to the desired section of bdix.