Difference between revisions of "Bengali and English/Issues"

From Apertium
Jump to navigation Jump to search
Line 2: Line 2:
   
 
* Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
 
* Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
  +
* The word "মায়ের" probably has no matching pardef. the pardef "মা__n_f" has no inflections like "-য়ের"
 
* Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".
 
* Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".
 
to check all of the instances of this problem i ran this:
 
to check all of the instances of this problem i ran this:
 
cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'
 
cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'
 
and the output is:
 
and the output is:
 
^কি/*কি$
<pre>
 
 
^ওই/*ওই$
^রেসলিং/*রেসলিং$
 
  +
* "প্রথম", "দ্বিতীয়" - is this determiner ? or numerals ?
^ইঞ্জিনিয়ারিং/*ইঞ্জিনিয়ারিং$
 
^শুটিং/*শুটিং$
 
^ভবিষ্যৎ/*ভবিষ্যৎ$
 
^মোড়/*মোড়$
 
^সাক্ষাৎ/*সাক্ষাৎ$
 
^বোলিং/*বোলিং$
 
^লং/*লং$
 
^কোচিং/*কোচিং$
 
^ব্যাংকিং/*ব্যাংকিং$
 
^ব্যাটিং/*ব্যাটিং$
 
^সম্প্রদায়/*সম্প্রদায়$
 
^উচ্চবিদ্যালয়/*উচ্চবিদ্যালয়$
 
^বিদ্যুৎ/*বিদ্যুৎ$
 
^কি/*কি$
 
^ওই/*ওই$
 
</pre>
 
   
 
==Tagset==
 
==Tagset==

Revision as of 21:07, 29 June 2011

Morphological Analyzer

  • Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
  • The word "মায়ের" probably has no matching pardef. the pardef "মা__n_f" has no inflections like "-য়ের"
  • Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".

to check all of the instances of this problem i ran this:

 cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'

and the output is:

 ^কি/*কি$
 ^ওই/*ওই$
  • "প্রথম", "দ্বিতীয়" - is this determiner ? or numerals ?

Tagset

  • Confusion on animacy 'elite': what is the exact definition ? Is these correct examples of <el> - "ক্রেতা", "বিদ্রোহী", "সহকারী" ? And is these not <el> for sure - "মেয়র", "ম্যাজিস্ট্রেট", "উপাচার্য"
  • Unmatched paradigm: "মামা", "চাচা" should be <m><hu> with the pardefs "ভাই__n_m" or "লোক__n_m" but neither of the two provides enough inflections for "মামার" or "চাচার"
  • What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ?
    <mf> is for a word that has the same form for both masculine and feminine versions, <nt> is for a neuter word. - Francis Tyers 19:56, 27 June 2011 (UTC)