Bengali and English/Updating Bilingual Dictionary

From Apertium
< Bengali and English
Revision as of 13:05, 25 August 2009 by Darthxaher (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

We are going to try adding more adjective entries in the bn-en bdix. Assuming that we are in the apertium-bn-en folder (download it from the svn), try this,

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 | \
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \
perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

we used grep '<adj>' to filter out the adjectives, and perl -pe 's/<comp>|<sup>//g' to remove the tags inflection tags from every adjective entry. Then we used uniq.py to filter the uniq entries instead of shell's 'uniq', which is not fully Unicode compliant.

Assume that we have this output saved in dev/bdix/adjective.list file. Let's see how the file looks like at the first glance.

চিহ্নিত<adj><mf>
মঞ্চস্থ<adj><mf>
ভিন্ন<adj><mf>
শিক্ষিত<adj><mf>
শাসক<adj><mf>
অন্তর্ভুক্ত<adj><mf>

Now we are going to add corresponding English entries to this file. So after adding entries the file looks like this

চিহ্নিত<adj><mf>    marked
মঞ্চস্থ<adj><mf>    #
ভিন্ন<adj><mf>    different
শিক্ষিত<adj><mf>    educated
শাসক<adj><mf>    ruler    !
অন্তর্ভুক্ত<adj><mf>    included

We can add these entries with the help of a an external en-bn dix, right now we are adding them manually. If you look closely, the entries are tab separated and there are few special characters. # is for any unknown entry, ! for a different pos (here, the entry 'শাসক' was mistakenly tagged adjective in the monodix) and ? is for any other confusion. We are going to discard any entry with any of these symbols in any subsequent processing.

After adding all the entries, we can try pos-tagging the English words. We're now going to use en-es dix for this purpose, since it has more entries in the eng monodix than ours.

cat dev/bdix/adjective.list | egrep -v '\#|\?|\!' | cut -f2 | tee /tmp/bar1 |  lt-proc ../../trunk/apertium-en-es-0.6/en-es.automorf.bin |  \
perl -pe 's/(\^)(.*?)(\/)//' | perl -pe 's/<comp>|<sup>//' | perl -pe 's/\/?\w*?(<vblex>|<n>|<adv>|<pr>|<prn>|<np>|<predet>)(<\w+>)*(\/|\$)?//g' | \
perl -pe 's/\$//g' > /tmp/bar2 && paste /tmp/bar1 /tmp/bar2 | awk '{ if($2 ~ /<adj>/) { print $0 } else { print $1"\t"$1"<adj>" } }' > /tmp/bar3

Surely, there will be entries which are not in the en-es monodix, we know for sure that they are adjectives, so we try tagging them with <adj>.

Now we open /tmp/bar3 file, check the entries, specially the ones marked as only <adj>, we need to mark them also with <sint> if they are synthetic.

We can now try merging English and Bengali tagged parts to bdix entries, we'll also need change the tags to bdix format.

cat dev/bdix/adjective.list | egrep -v '\#|\?|\!' | cut -f1 | perl -pe 's/ /<b\/>/g' > /tmp/bar4 && cat /tmp/bar3 | cut -f2 | 
perl -pe 's/ /<b\/>/g' > /tmp/bar5
paste /tmp/bar4 /tmp/bar5 | perl -pe 's/<(\w+)>/<s n="$1"\/>/g' | awk -F'\t' '{ print "    <e><p><l>"$1"</l><r>"$2"</r></p></e>" }'

The output will look something like this.

    <e><p><l>চিহ্নিত<s n="adj"/><s n="mf"/></l><r>marked<s n="adj"/></r></p></e>
    <e><p><l>ভিন্ন<s n="adj"/><s n="mf"/></l><r>different<s n="adj"/></r></p></e>
    <e><p><l>শিক্ষিত<s n="adj"/><s n="mf"/></l><r>educated<s n="adj"/></r></p></e>
    <e><p><l>অন্তর্ভুক্ত<s n="adj"/><s n="mf"/></l><r>included<s n="adj"/></r></p></e>

We can now safely add these entries to the desired section of bdix.