Difference between revisions of "Bengali and English/Issues"
Jump to navigation
Jump to search
(→Tagset) |
|||
(19 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
==Unicode Representation== |
|||
* In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-bn-en/dev/tools/fix-spelling.py?revision=31168&view=markup python script] which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). [http://wiki.apertium.org/wiki/ACX_format ACX] is not suitable as this is not one-to-one equivalence. |
|||
:For the time being use the spelling fixer aka the normalizer - [[User:Darthxaher]] |
|||
::Look at apertium-es-it for what it does with È vs. E' - [[User:Francis Tyers|Francis Tyers]] 11:05, 5 July 2011 (UTC) |
|||
==Morphological Analyzer== |
==Morphological Analyzer== |
||
* Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি" |
* Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি" |
||
*: - Solved with putting the inflections 'টি', 'টির' in the noun paradigms wherever 'টা', 'টার' is there. - [[User:Ragib06|Ragib Ahsan]] Mon Jul 4 10:35:17 UTC 2011 |
|||
⚫ | |||
* Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin". |
* Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin". |
||
to check all of the instances of this problem i ran this: |
to check all of the instances of this problem i ran this: |
||
cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*' |
cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*' |
||
and the output is: |
and the output is: |
||
⚫ | |||
⚫ | |||
* "প্রথম", "দ্বিতীয়" - is this determiner ? or numerals ? |
|||
* These words probably don't have appropriate paradigm definitions that supports all the inflections: |
|||
<pre> |
<pre> |
||
মা -> মায়ের |
|||
^রেসলিং/*রেসলিং$ |
|||
লোক -> লোকেরা |
|||
^ইঞ্জিনিয়ারিং/*ইঞ্জিনিয়ারিং$ |
|||
দেশ -> দেশগুলোতে, দেশটিতে |
|||
^শুটিং/*শুটিং$ |
|||
বই -> বইগুলির, বইগুলি |
|||
^ভবিষ্যৎ/*ভবিষ্যৎ$ |
|||
শক্তি -> শক্তিকে |
|||
^মোড়/*মোড়$ |
|||
মানদন্ডসমূহ ? |
|||
^সাক্ষাৎ/*সাক্ষাৎ$ |
|||
ভাষা -> ভাষাতেই |
|||
^বোলিং/*বোলিং$ |
|||
ইউরোপীয় -> ইউরোপীয়দের |
|||
^লং/*লং$ |
|||
^কোচিং/*কোচিং$ |
|||
^ব্যাংকিং/*ব্যাংকিং$ |
|||
^ব্যাটিং/*ব্যাটিং$ |
|||
^সম্প্রদায়/*সম্প্রদায়$ |
|||
^উচ্চবিদ্যালয়/*উচ্চবিদ্যালয়$ |
|||
^বিদ্যুৎ/*বিদ্যুৎ$ |
|||
⚫ | |||
⚫ | |||
</pre> |
</pre> |
||
Line 32: | Line 38: | ||
* What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ? |
* What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ? |
||
*: {{tag|mf}} is for a word that has the same form for both masculine and feminine versions, {{tag|nt}} is for a neuter word. |
*: {{tag|mf}} is for a word that has the same form for both masculine and feminine versions, {{tag|nt}} is for a neuter word. - [[User:Francis Tyers|Francis Tyers]] 19:56, 27 June 2011 (UTC) |
||
⚫ | |||
==Wrong Tagger Output== |
|||
Zaher sleeps → ^Zaher<np><ant><m><sg>$ <span style="color:#FF0000"> ^sleep<n><pl>$ </span>.<sent>$ |
|||
Does Zaher work? → ^Do<vbdo><pri><p3><sg>$ ^Zaher<np><ant><m><sg>$ <span style="color:#FF0000"> ^work<n><sg>$ </span>^?<sent>$^.<sent>$ |
|||
Did Zaher work? → ^Do<vbdo><past>$ ^Zaher<np><ant><m><sg>$ <span style="color:#FF0000"> ^work<n><sg>$ </span>^?<sent>$^.<sent>$ |
|||
Are you working? → ^Be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ <span style="color:#FF0000"> ^working<adj>$ </span>^?<sent>$^.<sent>$ |
|||
Is Zaher working? → ^Be<vbser><pri><p3><sg>$ ^Zaher<np><ant><m><sg>$ <span style="color:#FF0000"> ^working<adj>$ </span>^?<sent>$^.<sent>$ |
|||
Were you working? → ^Be<vbser><past>$ ^prpers<prn><obj><p2><mf><sp>$ <span style="color:#FF0000"> ^working<adj>$ </span>^?<sent>$^.<sent>$ |
|||
Was Zaher working? → ^Be<vbser><past><p3><sg>$ ^Zaher<np><ant><m><sg>$ <span style="color:#FF0000"> ^working<adj>$ </span>^?<sent>$^.<sent>$ |
|||
Have I worked? → <span style="color:#FF0000"> ^Have<vblex><pres>$ </span> ^prpers<prn><subj><p1><mf><sg>$ ^work<vblex><past>$^?<sent>$^.<sent>$ |
|||
Have I been working? → <span style="color:#FF0000">^Have<vblex><pres>$ </span> ^prpers<prn><subj><p1><mf><sg>$ ^be<vbser><pp>$ ^work<vblex><ger>$^?<sent>$^.<sent>$ |
|||
Have you been working? → ^Have<vblex><pres>$ ^prpers<prn><span style="color:#FF0000"><obj></span><p2><mf><sp>$ ^be<vbser><pp>$ ^work<vblex><ger>$^?<sent>$^.<sent>$ |
|||
Are you rich? → ^Be<vbser><pres>$ ^prpers<prn><span style="color:#FF0000"><obj></span><p2><mf><sp>$ ^rich<adj><sint>$^?<sent>$^.<sent>$ |
|||
Were you rich? → ^Be<vbser><past>$ ^prpers<prn><span style="color:#FF0000"><obj></span><p2><mf><sp>$ ^rich<adj><sint>$^.<sent>$ |
|||
Walking is good for health → <span style="color:#FF0000">^Walking<n><sg>$</span> ^be<vbser><pri><p3><sg>$ ^good<adj><sint>$ ^for<pr>$ ^health<n><sg>$^.<sent>$ |
|||
[[Category:Bengali and English]] |
[[Category:Bengali and English]] |
Latest revision as of 16:45, 23 August 2011
Unicode Representation[edit]
- In Bengali Unicode, the character 'য়' can be represented in two ways: 1) directly by '\u09DF' or 2) by putting '\u09AF' and '\u09BC' together. So is true for both the characters 'ড়'('\u09DC' and '\u09A1\u09BC') and 'ঢ়'('\u09DD' and '\u09A2\u09BC'). To solve the problem, currently there is a python script which replaces the corresponding to single one. But we need to come up with some solution from the xml (dictionary/analyzer). ACX is not suitable as this is not one-to-one equivalence.
- For the time being use the spelling fixer aka the normalizer - User:Darthxaher
- Look at apertium-es-it for what it does with È vs. E' - Francis Tyers 11:05, 5 July 2011 (UTC)
Morphological Analyzer[edit]
- Problem analyzing words with enclitic 'টি' - it can analyze "বিষয়" to "^বিষয়/বিষয়<n><nt><nn><sg><nom>/বিষয়<n><nt><nn><sg><obj>$" but can't analyze "বিষয়টি". Similarly can "বস্তু" but not "বস্তুটি"
- - Solved with putting the inflections 'টি', 'টির' in the noun paradigms wherever 'টা', 'টার' is there. - Ragib Ahsan Mon Jul 4 10:35:17 UTC 2011
- The word "মায়ের" probably has no matching pardef. the pardef "মা__n_f" has no inflections like "-য়ের"
- Some words have entries in bn.dix, yet they are not being analyzed with "lt-proc -a bn-en.automorf.bin".
to check all of the instances of this problem i ran this:
cat apertium-bn-en.bn.dix | grep '<e lm' | sed 's/.*lm=\"//g' | sed 's/\".*//g' | lt-proc -a bn-en.automorf.bin | grep '*'
and the output is:
^কি/*কি$ ^ওই/*ওই$
- "প্রথম", "দ্বিতীয়" - is this determiner ? or numerals ?
- These words probably don't have appropriate paradigm definitions that supports all the inflections:
মা -> মায়ের লোক -> লোকেরা দেশ -> দেশগুলোতে, দেশটিতে বই -> বইগুলির, বইগুলি শক্তি -> শক্তিকে মানদন্ডসমূহ ? ভাষা -> ভাষাতেই ইউরোপীয় -> ইউরোপীয়দের
Tagset[edit]
- Confusion on animacy 'elite': what is the exact definition ? Is these correct examples of <el> - "ক্রেতা", "বিদ্রোহী", "সহকারী" ? And is these not <el> for sure - "মেয়র", "ম্যাজিস্ট্রেট", "উপাচার্য"
- Unmatched paradigm: "মামা", "চাচা" should be <m><hu> with the pardefs "ভাই__n_m" or "লোক__n_m" but neither of the two provides enough inflections for "মামার" or "চাচার"
- What is the exact difference between <mf><nn> and <nt><nn> ? What are the exclusive properties ?
<mf>
is for a word that has the same form for both masculine and feminine versions,<nt>
is for a neuter word. - Francis Tyers 19:56, 27 June 2011 (UTC)
Wrong Tagger Output[edit]
Zaher sleeps → ^Zaher<np><ant><m><sg>$ ^sleep<n><pl>$ .<sent>$
Does Zaher work? → ^Do<vbdo><pri><p3><sg>$ ^Zaher<np><ant><m><sg>$ ^work<n><sg>$ ^?<sent>$^.<sent>$
Did Zaher work? → ^Do<vbdo><past>$ ^Zaher<np><ant><m><sg>$ ^work<n><sg>$ ^?<sent>$^.<sent>$
Are you working? → ^Be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ ^working<adj>$ ^?<sent>$^.<sent>$
Is Zaher working? → ^Be<vbser><pri><p3><sg>$ ^Zaher<np><ant><m><sg>$ ^working<adj>$ ^?<sent>$^.<sent>$
Were you working? → ^Be<vbser><past>$ ^prpers<prn><obj><p2><mf><sp>$ ^working<adj>$ ^?<sent>$^.<sent>$
Was Zaher working? → ^Be<vbser><past><p3><sg>$ ^Zaher<np><ant><m><sg>$ ^working<adj>$ ^?<sent>$^.<sent>$
Have I worked? → ^Have<vblex><pres>$ ^prpers<prn><subj><p1><mf><sg>$ ^work<vblex><past>$^?<sent>$^.<sent>$
Have I been working? → ^Have<vblex><pres>$ ^prpers<prn><subj><p1><mf><sg>$ ^be<vbser><pp>$ ^work<vblex><ger>$^?<sent>$^.<sent>$
Have you been working? → ^Have<vblex><pres>$ ^prpers<prn><obj><p2><mf><sp>$ ^be<vbser><pp>$ ^work<vblex><ger>$^?<sent>$^.<sent>$
Are you rich? → ^Be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ ^rich<adj><sint>$^?<sent>$^.<sent>$
Were you rich? → ^Be<vbser><past>$ ^prpers<prn><obj><p2><mf><sp>$ ^rich<adj><sint>$^.<sent>$
Walking is good for health → ^Walking<n><sg>$ ^be<vbser><pri><p3><sg>$ ^good<adj><sint>$ ^for<pr>$ ^health<n><sg>$^.<sent>$