Difference between revisions of "User:Ragib06/GSoC2011 Challenge Answer"
(→Answer) |
(→Answer) |
||
Line 25: | Line 25: | ||
তরল<adj><mf> liquid |
তরল<adj><mf> liquid |
||
মানসিক<adj><mf> mental |
মানসিক<adj><mf> mental |
||
ঝুঁকিপূর্ণ<adj><mf> risky |
|||
</pre> |
</pre> |
||
Now, for pos tagging we can again use the code from the mentioned article: |
Now, for pos tagging we can again use the code from the mentioned article: |
||
Line 41: | Line 41: | ||
And we have our new entries in '''apertium-bn-en.bn-en.dix.patch'''. Some samples: |
And we have our new entries in '''apertium-bn-en.bn-en.dix.patch'''. Some samples: |
||
<pre> |
<pre> |
||
<e><p><l>অজানা<s n="adj"/><s n="mf"/></l><r>unknown<s n="adj"/></r></p></e> |
|||
<e><p><l>তরল<s n="adj"/><s n="mf"/></l><r>liquid<s n="adj"/></r></p></e> |
|||
<e><p><l>মানসিক<s n="adj"/><s n="mf"/></l><r>mental<s n="adj"/></r></p></e> |
|||
<e><p><l>ঝুঁকিপূর্ণ<s n="adj"/><s n="mf"/></l><r>risky<s n="adj"/><s n="sint"/></r></p></e> |
|||
</pre> |
</pre> |
||
Revision as of 19:27, 6 April 2011
Challenge 1: Updating Bengali-English Bilingual Dictionary
Difficulty: Easy
Title: Update the Bengali English Bilingual Dictionary with 100 new entries. Instructions on how to add more entries are described in the article Updating Bilingual Dictionary
Deliverable: A patch file
Answer
Assuming that we are in the apertium-bn-en folder, we can run this to extract the adjectives from apertium-bn-en.bn.dix :
lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py
There are 866 adjectives expanded. Now, to get the adjective list from the current system we can run this (according to the article mentioned above):
lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \ sed 's/$/^.<sent>$/g' | apertium-pretransfer | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \ apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin | tee /tmp/foo2 | \ lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \ perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py
And we get 87 adjectives. Now, saving the adjective lists, we can run a simple script to find around 100 adjetives from those 866-87 = 779 non-listed entries. Say, we save these new adjectives into a file "newadj.txt". Now, we need to put the lemma mappings manually. A sample may be like this:
অজানা<adj><mf> unknown তরল<adj><mf> liquid মানসিক<adj><mf> mental ঝুঁকিপূর্ণ<adj><mf> risky
Now, for pos tagging we can again use the code from the mentioned article:
cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 | tee /tmp/bar1 | lt-proc en-bn.automorf.bin | \ perl -pe 's/(\^)(.*?)(\/)//' | perl -pe 's/<comp>|<sup>//' | perl -pe 's/\/?\w*?(<vblex>|<n>|<adv>|<pr>|<prn>|<np>|<predet>)(<\w+>)*(\/|\$)?//g' | \ perl -pe 's/\$//g' > /tmp/bar2 && paste /tmp/bar1 /tmp/bar2 | awk '{ if($2 ~ /<adj>/) { print $0 } else { print $1"\t"$1"<adj>" } }' > /tmp/bar3
Finally, to put things together we run this:
cat newadj.txt | egrep -v '\#|\?|\!' | cut -f1 | perl -pe 's/ /<b\/>/g' > /tmp/bar4 && cat /tmp/bar3 | cut -f2 | perl -pe 's/ /<b\/>/g' > /tmp/bar5 paste /tmp/bar4 /tmp/bar5 | perl -pe 's/<(\w+)>/<s n="$1"\/>/g' | awk -F'\t' '{ print " <e><p><l>"$1"</l><r>"$2"</r></p></e>" }' > apertium-bn-en.bn-en.dix.patch
And we have our new entries in apertium-bn-en.bn-en.dix.patch. Some samples:
<e><p><l>অজানা<s n="adj"/><s n="mf"/></l><r>unknown<s n="adj"/></r></p></e> <e><p><l>তরল<s n="adj"/><s n="mf"/></l><r>liquid<s n="adj"/></r></p></e> <e><p><l>মানসিক<s n="adj"/><s n="mf"/></l><r>mental<s n="adj"/></r></p></e> <e><p><l>ঝুঁকিপূর্ণ<s n="adj"/><s n="mf"/></l><r>risky<s n="adj"/><s n="sint"/></r></p></e>
Now, we need to check if the english lemmas that are entered in the new list, are actually present in the apertium-bn-en.en.dix. To do this, we can simply generate what we intend to add and what are already there. Then just check if ok:
cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 > myadjs.txt lt-expand apertium-bn-en.en.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<.*//g' | python dev/uniq.py > existingadjs.txt
Comparing these two files with a simple script there were 40 entries found those are not in "existingadjs.txt" thus not in apertium-bn-en.en.dix. We need to add those to get the transfer system working with the new entries.