Difference between revisions of "Bengali and English/Issues"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
==Unicode Representation== |
==Unicode Representation== |
||
* In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a |
* In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-bn-en/dev/tools/fix-spelling.py?revision=31168&view=markup python script] which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). [http://wiki.apertium.org/wiki/ACX_format ACX] is not suitable as this is not one-to-one equivalence. |
||
==Morphological Analyzer== |
==Morphological Analyzer== |
Revision as of 10:33, 4 July 2011
Unicode Representation
- In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a python script which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). ACX is not suitable as this is not one-to-one equivalence.
Morphological Analyzer
- Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
- - Solved with putting the inflections 'টি', 'টির' in the noun paradigms wherever 'টা', 'টার' is there. - Ragib Ahsan Mon Jul 4 10:35:17 UTC 2011
- The word "মায়ের" probably has no matching pardef. the pardef "মা__n_f" has no inflections like "-য়ের"
- Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".
to check all of the instances of this problem i ran this:
cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'
and the output is:
^কি/*কি$ ^ওই/*ওই$
- "প্রথম", "দ্বিতীয়" - is this determiner ? or numerals ?
- These words probably don't have appropriate paradigm definitions that supports all the inflections:
মা -> মায়ের লোক -> লোকেরা দেশ -> দেশগুলোতে, দেশটিতে বই -> বইগুলির, বইগুলি শক্তি -> শক্তিকে মানদন্ডসমূহ ? ভাষা -> ভাষাতেই ইউরোপীয় -> ইউরোপীয়দের
Tagset
- Confusion on animacy 'elite': what is the exact definition ? Is these correct examples of <el> - "ক্রেতা", "বিদ্রোহী", "সহকারী" ? And is these not <el> for sure - "মেয়র", "ম্যাজিস্ট্রেট", "উপাচার্য"
- Unmatched paradigm: "মামা", "চাচা" should be <m><hu> with the pardefs "ভাই__n_m" or "লোক__n_m" but neither of the two provides enough inflections for "মামার" or "চাচার"
- What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ?
<mf>
is for a word that has the same form for both masculine and feminine versions,<nt>
is for a neuter word. - Francis Tyers 19:56, 27 June 2011 (UTC)