Bengali and English/Issues

From Apertium
Jump to navigation Jump to search

Unicode Representation

  • In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a python script which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). ACX is not suitable as this is not one-to-one equivalence.
For the time being use the spelling fixer aka the normalizer - User:Darthxaher
Look at apertium-es-it for what it does with È vs. E' - Francis Tyers 11:05, 5 July 2011 (UTC)

Morphological Analyzer

  • Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
    - Solved with putting the inflections 'টি', 'টির' in the noun paradigms wherever 'টা', 'টার' is there. - Ragib Ahsan Mon Jul 4 10:35:17 UTC 2011
  • The word "মায়ের" probably has no matching pardef. the pardef "মা__n_f" has no inflections like "-য়ের"
  • Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".

to check all of the instances of this problem i ran this:

 cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'

and the output is:

 ^কি/*কি$
 ^ওই/*ওই$
  • "প্রথম", "দ্বিতীয়" - is this determiner ? or numerals ?
  • These words probably don't have appropriate paradigm definitions that supports all the inflections:
মা -> মায়ের
লোক -> লোকেরা
দেশ -> দেশগুলোতে, দেশটিতে
বই -> বইগুলির, বইগুলি
শক্তি -> শক্তিকে
মানদন্ডসমূহ ?
ভাষা -> ভাষাতেই
ইউরোপীয় -> ইউরোপীয়দের

Tagset

  • Confusion on animacy 'elite': what is the exact definition ? Is these correct examples of <el> - "ক্রেতা", "বিদ্রোহী", "সহকারী" ? And is these not <el> for sure - "মেয়র", "ম্যাজিস্ট্রেট", "উপাচার্য"
  • Unmatched paradigm: "মামা", "চাচা" should be <m><hu> with the pardefs "ভাই__n_m" or "লোক__n_m" but neither of the two provides enough inflections for "মামার" or "চাচার"
  • What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ?
    <mf> is for a word that has the same form for both masculine and feminine versions, <nt> is for a neuter word. - Francis Tyers 19:56, 27 June 2011 (UTC)


Wrong Tagger Output

  Zaher sleeps → ^Zaher<np><ant><m><sg>$  ^sleep<n><pl>$ .<sent>$
  Does Zaher work? → ^Do<vbdo><pri><p3><sg>$ ^Zaher<np><ant><m><sg>$  ^work<n><sg>$ ^?<sent>$^.<sent>$
  Did Zaher work? → ^Do<vbdo><past>$ ^Zaher<np><ant><m><sg>$  ^work<n><sg>$ ^?<sent>$^.<sent>$
  Are you working? → ^Be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$  ^working<adj>$ ^?<sent>$^.<sent>$
  Is Zaher working? → ^Be<vbser><pri><p3><sg>$ ^Zaher<np><ant><m><sg>$  ^working<adj>$ ^?<sent>$^.<sent>$
  Were you working? → ^Be<vbser><past>$ ^prpers<prn><obj><p2><mf><sp>$  ^working<adj>$ ^?<sent>$^.<sent>$
  Was Zaher working? → ^Be<vbser><past><p3><sg>$ ^Zaher<np><ant><m><sg>$  ^working<adj>$ ^?<sent>$^.<sent>$
  Have I worked? →  ^Have<vblex><pres>$  ^prpers<prn><subj><p1><mf><sg>$ ^work<vblex><past>$^?<sent>$^.<sent>$
  Have I been working? → ^Have<vblex><pres>$  ^prpers<prn><subj><p1><mf><sg>$ ^be<vbser><pp>$ ^work<vblex><ger>$^?<sent>$^.<sent>$
  Have you been working? → ^Have<vblex><pres>$ ^prpers<prn><obj><p2><mf><sp>$ ^be<vbser><pp>$ ^work<vblex><ger>$^?<sent>$^.<sent>$
  Are you rich? → ^Be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ ^rich<adj><sint>$^?<sent>$^.<sent>$