Difference between revisions of "Bengali and English/Updating Bilingual Dictionary"

From Apertium
Jump to navigation Jump to search
Line 11: Line 11:
 
we used <code>grep '<adj>'</code> to filter out the adjectives, and <code>perl -pe 's/<comp>|&lt;sup&gt;//g'</code> to remove the tags inflection tags from every adjective entry. Then we used <code>uniq.py</code> to filter the uniq entries instead of shell's <code>'uniq'</code>, which is not fully Unicode compliant.
 
we used <code>grep '<adj>'</code> to filter out the adjectives, and <code>perl -pe 's/<comp>|&lt;sup&gt;//g'</code> to remove the tags inflection tags from every adjective entry. Then we used <code>uniq.py</code> to filter the uniq entries instead of shell's <code>'uniq'</code>, which is not fully Unicode compliant.
   
Assume that we have this output saved in <code>dev/bdix/adjective.list</code> file. Let's see how the file looks like in the first glance.
+
Assume that we have this output saved in <code>dev/bdix/adjective.list</code> file. Let's see how the file looks like at the first glance.
 
<pre>
 
<pre>
 
চিহ্নিত<adj><mf>
 
চিহ্নিত<adj><mf>
Line 30: Line 30:
 
অন্তর্ভুক্ত<adj><mf> included
 
অন্তর্ভুক্ত<adj><mf> included
 
</pre>
 
</pre>
  +
  +
We can add these entries with the help of a an external en-bn dix, right now we are adding them manually. If you look closely, the entries are tab separated and there are few special characters. <code>#</code> is for any unknown entry, <code>!</code> for a different pos (here, the entry 'শাসক' was mistakenly tagged adjective in the monodix) and <code>?</code> is for any other confusion. We are going to discard any entry with any of these symbols in later processing.

Revision as of 12:16, 25 August 2009

We are going to try adding more adjective entries in the bn-en bdix. Assuming that we are in the apertium-bn-en folder (download it from the svn), try this,

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 | \
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \
perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

we used grep '<adj>' to filter out the adjectives, and perl -pe 's/<comp>|<sup>//g' to remove the tags inflection tags from every adjective entry. Then we used uniq.py to filter the uniq entries instead of shell's 'uniq', which is not fully Unicode compliant.

Assume that we have this output saved in dev/bdix/adjective.list file. Let's see how the file looks like at the first glance.

চিহ্নিত<adj><mf>
মঞ্চস্থ<adj><mf>
ভিন্ন<adj><mf>
শিক্ষিত<adj><mf>
শাসক<adj><mf>
অন্তর্ভুক্ত<adj><mf>

Now we are going to add corresponding English entries to this file. So after adding entries the file looks like this

চিহ্নিত<adj><mf>    marked
মঞ্চস্থ<adj><mf>    #
ভিন্ন<adj><mf>    different
শিক্ষিত<adj><mf>    educated
শাসক<adj><mf>    ruler    !
অন্তর্ভুক্ত<adj><mf>    included

We can add these entries with the help of a an external en-bn dix, right now we are adding them manually. If you look closely, the entries are tab separated and there are few special characters. # is for any unknown entry, ! for a different pos (here, the entry 'শাসক' was mistakenly tagged adjective in the monodix) and ? is for any other confusion. We are going to discard any entry with any of these symbols in later processing.