Talk:Hindi and Urdu
Message #1
Hi, you need to convert the IIIT morphological analyser to be compatible with Apertium. Thus: 1) The encoding needs to be changed from WX -> UTF-8 2) The tagset needs to be standardised along Apertium lines These two tasks are non-negotiable and should be completed as part of your application. You can find the incomplete language pair in SVN: https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi The analyser is partially converted (by me) here: https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi/apertium-ur-hi.hi.dix Analyse Urdu: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin ^عامل/عامل<np><ant><m><sg><nom>$ ^کی/کا<post><f><sg><nom>$ ^بیٹی/بیٹی<n><f><sg><nom>/بیٹی<n><f><sg><obl>/بیٹی<n><f><sg><voc>$ Analyse Hindi: $ echo "आमिल की बेटी" | lt-proc hi-ur.automorf.bin ^आमिल/आमिल<np><ant><m><sg><nom>$ ^की/का<post><f><sg><nom>/का<post><f><sg><obl>/का<post><f><pl><nom>/का<post><f><pl><obl>$ ^बेटी/बेटी<n><f><sg><nom>/बेटी<n><f><sg><obl>$ Tag Urdu: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob ^عامل<np><ant><m><sg><nom>$ ^کا<post><f><sg><nom>$ ^بیٹی<n><f><sg><nom>$ Transfer Urdu->Hindi $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin ^आमिल<np><ant><m><sg><nom>$ ^का<post><f><sg><nom>$ ^बेटी<n><f><sg><nom>$ Generate Hindi: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin | lt-proc -g ur-hi.autogen.bin आमिल की बेटी Best regards, Fran
Message #2
<hitesh> Hi <spectei> hi <hitesh> We were having a discussion about IIIT morphological parser yesterday <spectei> cool <spectei> you and who ? :) <hitesh> and you said that we need to convert it into unicode <hitesh> Do you want the output to be in unicode or what as the current output format is SSF <spectei> output = unicode <spectei> internal representation = unicode <spectei> input = unicode <hitesh> What is the advantage <spectei> that we don't have to have any special processing for hindi <spectei> that everything else in the project is in unicode <spectei> that we don't have to do any special processing to the input/output text <spectei> that we don't have to include other programs in apertium just for hindi <hitesh> Basically we will have to build a wrapper as the algorithm they use is dependent on hindi <spectei> no <spectei> we are going to convert it <spectei> no wrappers <hitesh> Ok , so we will build the morphological analyzer from scratch <spectei> you could do it that way if you want <spectei> the alternative is to generate a full form list <spectei> convert to unicode <spectei> and re-infer paradigms from that <spectei> that's what i did for adjectives and nouns <spectei> where i more or less understand the inflection <spectei> and your tagset <spectei> for verbs i don't <hitesh> We use FSTs in this , right so how we re-infer the paradigms <spectei> you generate the possibilities <spectei> and then merge them <spectei> http://wiki.apertium.org/wiki/Speling_format <spectei> http://wiki.apertium.org/wiki/Speling_tools <spectei> http://wiki.apertium.org/wiki/Paradigm_chopper <spectei> and we use FSTs too