Difference between revisions of "Bengali and English/Issues"

From Apertium
Jump to navigation Jump to search
Line 3: Line 3:
 
* In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-bn-en/dev/tools/fix-spelling.py?revision=31168&view=markup python script] which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). [http://wiki.apertium.org/wiki/ACX_format ACX] is not suitable as this is not one-to-one equivalence.
 
* In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-bn-en/dev/tools/fix-spelling.py?revision=31168&view=markup python script] which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). [http://wiki.apertium.org/wiki/ACX_format ACX] is not suitable as this is not one-to-one equivalence.
   
:For the time being use the spelling fixer aka the normalizer
+
:For the time being use the spelling fixer aka the normalizer - [[User:Darthxaher]]
   
 
==Morphological Analyzer==
 
==Morphological Analyzer==

Revision as of 18:45, 4 July 2011

Unicode Representation

  • In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a python script which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). ACX is not suitable as this is not one-to-one equivalence.
For the time being use the spelling fixer aka the normalizer - User:Darthxaher

Morphological Analyzer

  • Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
    - Solved with putting the inflections 'টি', 'টির' in the noun paradigms wherever 'টা', 'টার' is there. - Ragib Ahsan Mon Jul 4 10:35:17 UTC 2011
  • The word "মায়ের" probably has no matching pardef. the pardef "মা__n_f" has no inflections like "-য়ের"
  • Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".

to check all of the instances of this problem i ran this:

 cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'

and the output is:

 ^কি/*কি$
 ^ওই/*ওই$
  • "প্রথম", "দ্বিতীয়" - is this determiner ? or numerals ?
  • These words probably don't have appropriate paradigm definitions that supports all the inflections:
মা -> মায়ের
লোক -> লোকেরা
দেশ -> দেশগুলোতে, দেশটিতে
বই -> বইগুলির, বইগুলি
শক্তি -> শক্তিকে
মানদন্ডসমূহ ?
ভাষা -> ভাষাতেই
ইউরোপীয় -> ইউরোপীয়দের

Tagset

  • Confusion on animacy 'elite': what is the exact definition ? Is these correct examples of <el> - "ক্রেতা", "বিদ্রোহী", "সহকারী" ? And is these not <el> for sure - "মেয়র", "ম্যাজিস্ট্রেট", "উপাচার্য"
  • Unmatched paradigm: "মামা", "চাচা" should be <m><hu> with the pardefs "ভাই__n_m" or "লোক__n_m" but neither of the two provides enough inflections for "মামার" or "চাচার"
  • What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ?
    <mf> is for a word that has the same form for both masculine and feminine versions, <nt> is for a neuter word. - Francis Tyers 19:56, 27 June 2011 (UTC)