Difference between revisions of "Bengali and English/Updating Bilingual Dictionary"

From Apertium
Jump to navigation Jump to search
Line 9: Line 9:
 
</pre>
 
</pre>
   
we used <code>grep '<adj>'</code> to filter out the adjectives, and <code>perl -pe 's/<comp>|&lt;sup&gt;//g'</code> to remove the tags inflection tags from every adjective entry. Then we used <code>uniq.py</code> to filter the uniq entries instead of shell's <code>'uniq'</code>, which is not fully unicode compliant.
+
we used <code>grep '<adj>'</code> to filter out the adjectives, and <code>perl -pe 's/<comp>|&lt;sup&gt;//g'</code> to remove the tags inflection tags from every adjective entry. Then we used <code>uniq.py</code> to filter the uniq entries instead of shell's <code>'uniq'</code>, which is not fully Unicode compliant.
  +
  +
Assume that we have this output saved in <code>dev/bdix/adjective.list</code> file. Let's see how the file looks like in the first glance.
  +
<pre>
  +
চিহ্নিত<adj><mf>
  +
মঞ্চস্থ<adj><mf>
  +
ভিন্ন<adj><mf>
  +
শিক্ষিত<adj><mf>
  +
শাসক<adj><mf>
  +
অন্তর্ভুক্ত<adj><mf>
  +
</pre>
  +
  +
Now we are going to add corresponding English entries to this file. So after adding entries the file looks like this
  +
<pre>
  +
চিহ্নিত<adj><mf> marked
  +
মঞ্চস্থ<adj><mf> #
  +
ভিন্ন<adj><mf> different
  +
শিক্ষিত<adj><mf> educated
  +
শাসক<adj><mf> ruler !
  +
অন্তর্ভুক্ত<adj><mf> included
  +
</pre>

Revision as of 12:11, 25 August 2009

We are going to try adding more adjective entries in the bn-en bdix. Assuming that we are in the apertium-bn-en folder (download it from the svn), try this,

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 | \
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \
perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

we used grep '<adj>' to filter out the adjectives, and perl -pe 's/<comp>|<sup>//g' to remove the tags inflection tags from every adjective entry. Then we used uniq.py to filter the uniq entries instead of shell's 'uniq', which is not fully Unicode compliant.

Assume that we have this output saved in dev/bdix/adjective.list file. Let's see how the file looks like in the first glance.

চিহ্নিত<adj><mf>
মঞ্চস্থ<adj><mf>
ভিন্ন<adj><mf>
শিক্ষিত<adj><mf>
শাসক<adj><mf>
অন্তর্ভুক্ত<adj><mf>

Now we are going to add corresponding English entries to this file. So after adding entries the file looks like this

চিহ্নিত<adj><mf>    marked
মঞ্চস্থ<adj><mf>    #
ভিন্ন<adj><mf>    different
শিক্ষিত<adj><mf>    educated
শাসক<adj><mf>    ruler    !
অন্তর্ভুক্ত<adj><mf>    included