Talk:Hindi and Urdu

Message #1

Hi, 

you need to convert the IIIT morphological analyser to be compatible with Apertium. Thus:

1) The encoding needs to be changed from WX -> UTF-8
2) The tagset needs to be standardised along Apertium lines

These two tasks are non-negotiable and should be completed as part of your application.

You can find the incomplete language pair in SVN:

https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi

The analyser is partially converted (by me) here:

https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi/apertium-ur-hi.hi.dix

Analyse Urdu:

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin 
^عامل/عامل<np><ant><m><sg><nom>$ ^کی/کا<post><f><sg><nom>$ ^بیٹی/بیٹی<n><f><sg><nom>/بیٹی<n><f><sg><obl>/بیٹی<n><f><sg><voc>$

Analyse Hindi:

$ echo "आमिल की बेटी" | lt-proc hi-ur.automorf.bin
^आमिल/आमिल<np><ant><m><sg><nom>$ ^की/का<post><f><sg><nom>/का<post><f><sg><obl>/का<post><f><pl><nom>/का<post><f><pl><obl>$ ^बेटी/बेटी<n><f><sg><nom>/बेटी<n><f><sg><obl>$

Tag Urdu:

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob 
^عامل<np><ant><m><sg><nom>$ ^کا<post><f><sg><nom>$ ^بیٹی<n><f><sg><nom>$

Transfer Urdu->Hindi

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob  | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin
^आमिल<np><ant><m><sg><nom>$ ^का<post><f><sg><nom>$ ^बेटी<n><f><sg><nom>$

Generate Hindi:

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob  | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin | lt-proc -g ur-hi.autogen.bin
आमिल की बेटी

Best regards,

Fran

Message #2

<hitesh> Hi
<spectei> hi
<hitesh> We were having a discussion about IIIT morphological parser yesterday
<spectei> cool
<spectei> you and who ? :)
<hitesh> and you said that we need to convert it into unicode
<hitesh> Do you want the output to be in unicode or what as the current output format is SSF
<spectei> output = unicode
<spectei> internal representation = unicode
<spectei> input = unicode
<hitesh> What is the advantage
<spectei> that we don't have to have any special processing for hindi
<spectei> that everything else in the project is in unicode
<spectei> that we don't have to do any special processing to the input/output text
<spectei> that we don't have to include other programs in apertium just for hindi
<hitesh> Basically we will have to build a wrapper as the algorithm they use is dependent on hindi
<spectei> no
<spectei> are going to convert it
<spectei> *we
<spectei> no wrappers
<hitesh> Ok , so we will build the morphological analyzer from scratch
<spectei> you could do it that way if you want
<spectei> the alternative is to generate a full form list
<spectei> convert to unicode
<spectei> and re-infer paradigms from that
<spectei> that's what i did for adjectives and nouns
<spectei> where i more or less understand the inflection
<spectei> and your tagset
<spectei> for verbs i don't
<hitesh> We use FSTs in this , right so how we re-infer the paradigms
<spectei> you generate the possibilities 
<spectei> and then merge them
<spectei> http://wiki.apertium.org/wiki/Speling_format
<spectei> http://wiki.apertium.org/wiki/Speling_tools
<spectei> http://wiki.apertium.org/wiki/Paradigm_chopper
<spectei> and we use FSTs too

Talk:Hindi and Urdu

Message #1

Message #2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools