Difference between revisions of "Bengali and English/Issues"

Revision as of 12:29, 27 June 2011

Morphological Analyzer

Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".

to check all of the instances of this problem i ran this:

 cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'

and the output is:

^রেসলিং/*রেসলিং$
^ইঞ্জিনিয়ারিং/*ইঞ্জিনিয়ারিং$
^শুটিং/*শুটিং$
^ভবিষ্যৎ/*ভবিষ্যৎ$
^মোড়/*মোড়$
^সাক্ষাৎ/*সাক্ষাৎ$
^বোলিং/*বোলিং$
^লং/*লং$
^কোচিং/*কোচিং$
^ব্যাংকিং/*ব্যাংকিং$
^ব্যাটিং/*ব্যাটিং$
^সম্প্রদায়/*সম্প্রদায়$
^উচ্চবিদ্যালয়/*উচ্চবিদ্যালয়$
^বিদ্যুৎ/*বিদ্যুৎ$
^কি/*কি$
^ওই/*ওই$

Tagset

Confusion on animacy 'elite': what is the exact definition ? Is these correct examples of <el> - "ক্রেতা", "বিদ্রোহী", "সহকারী" ? And is these not <el> for sure - "মেয়র", "ম্যাজিস্ট্রেট", "উপাচার্য"

Unmatched paradigm: "মামা", "চাচা" should be <m><hu> with the pardefs "ভাই__n_m" or "লোক__n_m" but neither of the two provides enough inflections for "মামার" or "চাচার"

What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ?

The word "মায়ের" probably has no matching pardef. the pardef "মা__n_f" has no inflections like "-য়ের"

@@ Line 2: / Line 2: @@
 * Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
-* Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin". Say, for "সময়", "জাতীয়" we have corresponding entries:
+* Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".
-      <e lm="সময়"><i>সময়</i><par n="গড়__n_mf" /></e>
-      <e lm="জাতীয়"><i>জাতীয়</i><par n="টক__adj" /></e>
-still the output:
-  echo "সময়" | lt-proc -a bn-en.automorf.bin
-  ^সময়/*সময়$
-  echo "জাতীয়" | lt-proc -a bn-en.automorf.bin
-  ^জাতীয়/*জাতীয়$
 to check all of the instances of this problem i ran this:
   cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'
-and surprisingly the output is:
+and the output is:
 <pre>
 ^রেসলিং/*রেসলিং$
@@ Line 31: / Line 24: @@
 ^ওই/*ওই$
 </pre>
-where there is no "সময়" or "জাতীয়" ! And I have cheked the analysis:
-  ^সময়/সময়<n><mf><nn><sg><nom>/সময়<n><mf><nn><sg><obj>$
-  ^জাতীয়/জাতীয়<adj><mf>$
 ==Tagset==

Difference between revisions of "Bengali and English/Issues"

Revision as of 12:29, 27 June 2011

Morphological Analyzer

Tagset

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools