Difference between revisions of "User:Ragib06/GSoC2011 Challenge Answer"
Line 70: | Line 70: | ||
</pre> |
</pre> |
||
grab the non-listed 100 nouns: |
|||
python code to grab unmatched entries (getnew.py): |
|||
<pre> |
<pre> |
||
diff noun.total noun.existing --normal | egrep -v '[0-9].*' | sed 's/< //g' | head -100 > noun.new100 |
|||
import sys |
|||
if len(sys.argv) != 4: |
|||
print 'wrong/no argument' |
|||
sys.exit(); |
|||
FILE1 = sys.argv[1] |
|||
FILE2 = sys.argv[2] |
|||
limit = int(sys.argv[3]) |
|||
li1,li2 = [],[] |
|||
with open(FILE1,'r') as f1: |
|||
li1 = f1.readlines() |
|||
with open(FILE2,'r') as f2: |
|||
li2 = f2.readlines() |
|||
r = list(set(li1) - set(li2)) |
|||
if limit == -1 : |
|||
limit = len(r) |
|||
for i in range(limit): |
|||
print r[i], |
|||
</pre> |
|||
to use this we just need to run: |
|||
<pre> |
|||
./getnew.py noun.total noun.existing 100 |
|||
</pre> |
</pre> |
||
And here's some of the output: |
And here's some of the output: |
||
<pre> |
<pre> |
||
বোধ<n><mf><nn> |
|||
ঝাপসা<n><nt><nn> |
|||
মেধা<n><mf><nn> |
|||
গালমন্দ<n><nt><nn> |
|||
পারিশ্রমিক<n><nt><nn> |
|||
</pre> |
</pre> |
Revision as of 21:07, 11 April 2011
Challenge 1: Updating Bengali-English Bilingual Dictionary
Difficulty: Easy
Title: Update the Bengali English Bilingual Dictionary with 100 new entries. Instructions on how to add more entries are described in the article Updating Bilingual Dictionary
Deliverable: A patch file
Answer
Assuming that we are in the apertium-bn-en folder, we can run this to extract the adjectives from apertium-bn-en.bn.dix :
lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py
Say there are #1 adjectives expanded. Now, to get the adjective list from the current system we can run this (according to the article mentioned above):
lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \ sed 's/$/^.<sent>$/g' | apertium-pretransfer | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \ apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin | tee /tmp/foo2 | \ lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \ perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py
And we get, say #2 adjectives. Now, saving the adjective lists, we can run a simple script to find around 100 adjectives from those (#1 - #2) non-listed entries. Say, we save these new adjectives into a file "newadj.txt". Now, we need to put the lemma mappings manually. A sample may be like this:
অজানা<adj><mf> unknown তরল<adj><mf> liquid মানসিক<adj><mf> mental ঝুঁকিপূর্ণ<adj><mf> risky
Now, for pos-tagging we can again use the code from the mentioned article:
cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 | tee /tmp/bar1 | lt-proc en-bn.automorf.bin | \ perl -pe 's/(\^)(.*?)(\/)//' | perl -pe 's/<comp>|<sup>//' | perl -pe 's/\/?\w*?(<vblex>|<n>|<adv>|<pr>|<prn>|<np>|<predet>)(<\w+>)*(\/|\$)?//g' | \ perl -pe 's/\$//g' > /tmp/bar2 && paste /tmp/bar1 /tmp/bar2 | awk '{ if($2 ~ /<adj>/) { print $0 } else { print $1"\t"$1"<adj>" } }' > /tmp/bar3
Finally, to put things together we run this:
cat newadj.txt | egrep -v '\#|\?|\!' | cut -f1 | perl -pe 's/ /<b\/>/g' > /tmp/bar4 && cat /tmp/bar3 | cut -f2 | perl -pe 's/ /<b\/>/g' > /tmp/bar5 paste /tmp/bar4 /tmp/bar5 | perl -pe 's/<(\w+)>/<s n="$1"\/>/g' | awk -F'\t' '{ print " <e><p><l>"$1"</l><r>"$2"</r></p></e>" }' > apertium-bn-en.bn-en.dix.patch
And we have our new entries in apertium-bn-en.bn-en.dix.patch. Some samples:
<e><p><l>অজানা<s n="adj"/><s n="mf"/></l><r>unknown<s n="adj"/></r></p></e> <e><p><l>তরল<s n="adj"/><s n="mf"/></l><r>liquid<s n="adj"/></r></p></e> <e><p><l>মানসিক<s n="adj"/><s n="mf"/></l><r>mental<s n="adj"/></r></p></e> <e><p><l>ঝুঁকিপূর্ণ<s n="adj"/><s n="mf"/></l><r>risky<s n="adj"/><s n="sint"/></r></p></e>
Now, we need to check if the english lemmas that are entered in the new list, are actually present in the apertium-bn-en.en.dix. To do this, we can simply generate what we intend to add and what are already there. Then just check if ok:
cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 > myadjs.txt lt-expand apertium-bn-en.en.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<.*//g' | python dev/uniq.py > existingadjs.txt
Comparing these two lists there were 40 entries found those are not in apertium-bn-en.en.dix. We need to add those to the english monodix to get the transfer system working with the new entries.
Answer : More (100 new nouns)
grab the nouns from monodix:
lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.total
grab existing nouns in bdix:
lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | sed 's/$/^.<sent>$/g' | apertium-pretransfer | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin | tee /tmp/foo2 | lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.existing
grab the non-listed 100 nouns:
diff noun.total noun.existing --normal | egrep -v '[0-9].*' | sed 's/< //g' | head -100 > noun.new100
And here's some of the output:
বোধ<n><mf><nn> ঝাপসা<n><nt><nn> মেধা<n><mf><nn> গালমন্দ<n><nt><nn> পারিশ্রমিক<n><nt><nn>