Difference between revisions of "User:Ragib06/GSoC2011 Challenge Answer"

From Apertium
Jump to navigation Jump to search
Line 53: Line 53:
 
</pre>
 
</pre>
 
Comparing these two lists there were 40 entries found those are not in '''apertium-bn-en.en.dix'''. We need to add those to the english monodix to get the transfer system working with the new entries.
 
Comparing these two lists there were 40 entries found those are not in '''apertium-bn-en.en.dix'''. We need to add those to the english monodix to get the transfer system working with the new entries.
  +
  +
  +
== Answer : More (100 new nouns) ==
  +
grab the nouns from monodix:
  +
<pre>
  +
lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.total
  +
</pre>
  +
  +
grab existing nouns in bdix:
  +
<pre>
  +
lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' |
  +
sed 's/$/^.<sent>$/g' | apertium-pretransfer | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin |
  +
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin | tee /tmp/foo2 |
  +
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 |
  +
perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.existing
  +
</pre>
  +
  +
python code to grab unmatched entries (getnew.py):
  +
<pre>
  +
import sys
  +
if len(sys.argv) != 4:
  +
print 'wrong/no argument'
  +
sys.exit();
  +
FILE1 = sys.argv[1]
  +
FILE2 = sys.argv[2]
  +
limit = int(sys.argv[3])
  +
li1,li2 = [],[]
  +
with open(FILE1,'r') as f1:
  +
li1 = f1.readlines()
  +
with open(FILE2,'r') as f2:
  +
li2 = f2.readlines()
  +
r = list(set(li1) - set(li2))
  +
if limit == -1 :
  +
limit = len(r)
  +
for i in range(limit):
  +
print r[i],
  +
</pre>
  +
  +
to use this we just need to run:
  +
<pre>
  +
./getnew.py noun.total noun.existing 100
  +
</pre>
  +
  +
And here's some of the output:
  +
<pre>
  +
ভরা<n><mf><nn>
  +
সূত্র<n><nt><nn>
  +
নির্দেশ<n><mf><nn>
  +
ধ্বংস<n><mf><nn>
  +
কারাদণ্ড<n><mf><nn>
  +
</pre>

Revision as of 20:45, 11 April 2011

Challenge 1: Updating Bengali-English Bilingual Dictionary

Difficulty: Easy

Title: Update the Bengali English Bilingual Dictionary with 100 new entries. Instructions on how to add more entries are described in the article Updating Bilingual Dictionary

Deliverable: A patch file

Answer

Assuming that we are in the apertium-bn-en folder, we can run this to extract the adjectives from apertium-bn-en.bn.dix :

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

Say there are #1 adjectives expanded. Now, to get the adjective list from the current system we can run this (according to the article mentioned above):

lt-expand apertium-bn-en.bn.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' | \
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin | \
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 | \
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 | \
perl -pe 's/<comp>|<sup>//g' | python dev/uniq.py

And we get, say #2 adjectives. Now, saving the adjective lists, we can run a simple script to find around 100 adjectives from those (#1 - #2) non-listed entries. Say, we save these new adjectives into a file "newadj.txt". Now, we need to put the lemma mappings manually. A sample may be like this:

অজানা<adj><mf>	unknown
তরল<adj><mf>	liquid
মানসিক<adj><mf>	mental
ঝুঁকিপূর্ণ<adj><mf>	risky

Now, for pos-tagging we can again use the code from the mentioned article:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 | tee /tmp/bar1 |  lt-proc en-bn.automorf.bin |  \
perl -pe 's/(\^)(.*?)(\/)//' | perl -pe 's/<comp>|<sup>//' | perl -pe 's/\/?\w*?(<vblex>|<n>|<adv>|<pr>|<prn>|<np>|<predet>)(<\w+>)*(\/|\$)?//g' | \
perl -pe 's/\$//g' > /tmp/bar2 && paste /tmp/bar1 /tmp/bar2 | awk '{ if($2 ~ /<adj>/) { print $0 } else { print $1"\t"$1"<adj>" } }' > /tmp/bar3

Finally, to put things together we run this:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f1 | perl -pe 's/ /<b\/>/g' > /tmp/bar4 && cat /tmp/bar3 | cut -f2 | 
perl -pe 's/ /<b\/>/g' > /tmp/bar5
paste /tmp/bar4 /tmp/bar5 | perl -pe 's/<(\w+)>/<s n="$1"\/>/g' | awk -F'\t' '{ print "    <e><p><l>"$1"</l><r>"$2"</r></p></e>" }' > apertium-bn-en.bn-en.dix.patch

And we have our new entries in apertium-bn-en.bn-en.dix.patch. Some samples:

<e><p><l>অজানা<s n="adj"/><s n="mf"/></l><r>unknown<s n="adj"/></r></p></e>
<e><p><l>তরল<s n="adj"/><s n="mf"/></l><r>liquid<s n="adj"/></r></p></e>
<e><p><l>মানসিক<s n="adj"/><s n="mf"/></l><r>mental<s n="adj"/></r></p></e>
<e><p><l>ঝুঁকিপূর্ণ<s n="adj"/><s n="mf"/></l><r>risky<s n="adj"/><s n="sint"/></r></p></e>

Now, we need to check if the english lemmas that are entered in the new list, are actually present in the apertium-bn-en.en.dix. To do this, we can simply generate what we intend to add and what are already there. Then just check if ok:

cat newadj.txt | egrep -v '\#|\?|\!' | cut -f2 > myadjs.txt
lt-expand apertium-bn-en.en.dix| grep '<adj>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/<.*//g' | python dev/uniq.py > existingadjs.txt

Comparing these two lists there were 40 entries found those are not in apertium-bn-en.en.dix. We need to add those to the english monodix to get the transfer system working with the new entries.


Answer : More (100 new nouns)

grab the nouns from monodix:

lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.total

grab existing nouns in bdix:

lt-expand apertium-bn-en.bn.dix| grep '<n>' | sed 's/:>:/:/g' | sed 's/:<:/:/g' | cut -f2 -d':' | tee /tmp/foo1 | sed 's/^/^/g' | sed 's/$/$/g' |
sed 's/$/^.<sent>$/g'  | apertium-pretransfer  | apertium-transfer apertium-bn-en.bn-en.t1x bn-en.t1x.bin bn-en.autobil.bin |
apertium-interchunk apertium-bn-en.bn-en.t2x bn-en.t2x.bin | apertium-postchunk apertium-bn-en.bn-en.t3x bn-en.t3x.bin  | tee /tmp/foo2 |
lt-proc -g bn-en.autogen.bin > /tmp/foo3 && paste /tmp/foo1 /tmp/foo2 /tmp/foo3 | egrep -v '\+' | egrep -v '@' | cut -f1 |
perl -pe 's/(<sg>|<pl>).*//g' | python dev/uniq.py > noun.existing

python code to grab unmatched entries (getnew.py):

import sys
if len(sys.argv) != 4:
	print 'wrong/no argument'
	sys.exit();
FILE1 = sys.argv[1]
FILE2 = sys.argv[2]
limit = int(sys.argv[3])
li1,li2 = [],[]
with open(FILE1,'r') as f1:
	li1 = f1.readlines()
with open(FILE2,'r') as f2:
	li2 = f2.readlines()
r = list(set(li1) - set(li2))
if limit == -1 :
	limit = len(r)
for i in range(limit):
	print r[i],

to use this we just need to run:

./getnew.py noun.total noun.existing 100

And here's some of the output:

ভরা<n><mf><nn>
সূত্র<n><nt><nn>
নির্দেশ<n><mf><nn>
ধ্বংস<n><mf><nn>
কারাদণ্ড<n><mf><nn>