Talk:Hindi and Urdu
Jump to navigation
Jump to search
Contents
Message #1[edit]
Hi, you need to convert the IIIT morphological analyser to be compatible with Apertium. Thus: 1) The encoding needs to be changed from WX -> UTF-8 2) The tagset needs to be standardised along Apertium lines These two tasks are non-negotiable and should be completed as part of your application. You can find the incomplete language pair in SVN: https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi The analyser is partially converted (by me) here: https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi/apertium-ur-hi.hi.dix Analyse Urdu: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin ^عامل/عامل<np><ant><m><sg><nom>$ ^کی/کا<post><f><sg><nom>$ ^بیٹی/بیٹی<n><f><sg><nom>/بیٹی<n><f><sg><obl>/بیٹی<n><f><sg><voc>$ Analyse Hindi: $ echo "आमिल की बेटी" | lt-proc hi-ur.automorf.bin ^आमिल/आमिल<np><ant><m><sg><nom>$ ^की/का<post><f><sg><nom>/का<post><f><sg><obl>/का<post><f><pl><nom>/का<post><f><pl><obl>$ ^बेटी/बेटी<n><f><sg><nom>/बेटी<n><f><sg><obl>$ Tag Urdu: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob ^عامل<np><ant><m><sg><nom>$ ^کا<post><f><sg><nom>$ ^بیٹی<n><f><sg><nom>$ Transfer Urdu->Hindi $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin ^आमिल<np><ant><m><sg><nom>$ ^का<post><f><sg><nom>$ ^बेटी<n><f><sg><nom>$ Generate Hindi: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin | lt-proc -g ur-hi.autogen.bin आमिल की बेटी Best regards, Fran
Message #2[edit]
<hitesh> Hi <spectei> hi <hitesh> We were having a discussion about IIIT morphological parser yesterday <spectei> cool <spectei> you and who ? :) <hitesh> and you said that we need to convert it into unicode <hitesh> Do you want the output to be in unicode or what as the current output format is SSF <spectei> output = unicode <spectei> internal representation = unicode <spectei> input = unicode <hitesh> What is the advantage <spectei> that we don't have to have any special processing for hindi <spectei> that everything else in the project is in unicode <spectei> that we don't have to do any special processing to the input/output text <spectei> that we don't have to include other programs in apertium just for hindi <hitesh> Basically we will have to build a wrapper as the algorithm they use is dependent on hindi <spectei> no <spectei> we are going to convert it <spectei> no wrappers <hitesh> Ok , so we will build the morphological analyzer from scratch <spectei> you could do it that way if you want <spectei> the alternative is to generate a full form list <spectei> convert to unicode <spectei> and re-infer paradigms from that <spectei> that's what i did for adjectives and nouns <spectei> where i more or less understand the inflection <spectei> and your tagset <spectei> for verbs i don't <hitesh> We use FSTs in this , right so how we re-infer the paradigms <spectei> you generate the possibilities <spectei> and then merge them <spectei> http://wiki.apertium.org/wiki/Speling_format <spectei> http://wiki.apertium.org/wiki/Speling_tools <spectei> http://wiki.apertium.org/wiki/Paradigm_chopper <spectei> and we use FSTs too
Message #3[edit]
[16:49] <spectei> NepaliKoChoro_, please document what i say [16:49] <spectei> on the wiki [16:49] <spectei> so i don't have to repeat it to the next guy :( [16:49] <NepaliKoChoro_> yes i read it :) [16:49] <spectei> $ lt-expand apertium-hi.hi_WX.dix | head [16:49] <spectei> aBAgA:aBAgA<cat:adj><case:d><num:s><gen:m> [16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:s><gen:m> [16:49] <spectei> aBAge:aBAgA<cat:adj><case:d><num:p><gen:m> [16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:p><gen:m> [16:50] <spectei> aBAgA:aBAgA<cat:n><num:s><case:d><gen:m> [16:50] <spectei> aBAge:aBAgA<cat:n><num:p><case:d><gen:m> [16:50] <spectei> aBAge:aBAgA<cat:n><num:s><case:o><gen:m> [16:50] <spectei> aBAgoM:aBAgA<cat:n><num:p><case:o><gen:m> [16:50] <spectei> aBAgI:aBAgI<cat:adj><case:any><num:any><gen:f> [16:50] <spectei> aBagna:aBagna<cat:adj><gen:any><num:any><case:any> [16:50] <spectei> [16:50] <spectei> this will give you a full form list [16:50] <NepaliKoChoro_> okay [16:50] <spectei> $ lt-expand apertium-hi.hi_WX.dix | grep '<cat:v>' | head -3 [16:50] <spectei> acakacAUz:acakacA<cat:v><per:u><num:s><gen:any><tam:subj> [16:50] <spectei> acakacAye:acakacA<cat:v><per:m><num:s><gen:any><tam:subj> [16:50] <spectei> acakacAe:acakacA<cat:v><per:m><num:s><gen:any><tam:subj> [16:50] <spectei> [16:50] <spectei> this will give you verbs [16:50] <spectei> you convert the tags here [16:50] <spectei> to apertium ones [16:50] <spectei> and you convert the words [16:50] <spectei> to unicode [16:50] == rahul [~Rahul@117.211.88.150] has joined #apertium [16:50] <spectei> then you convert the whole thing to speling format [16:51] <spectei> then you run speling-paradigms [16:51] <spectei> then you run paradigm-chopper [16:51] <spectei> then you check the result [16:51] <spectei> [16:51] <spectei> got it ? [16:51] <NepaliKoChoro_> yes [16:51] <NepaliKoChoro_> will be onto it [16:51] <spectei> ok [16:51] <NepaliKoChoro_> thank you very much
Message #4[edit]
<Opsrc> :) <Opsrc> forgot about one thing .. I got a way that morph analyser accepts unicode .. evrythng is fine .. just have to make the pipeline .. <Opsrc> it's a python scrip I need to add .. <spectei> no <spectei> i told you <spectei> you need to convert the xml <spectei> :) <spectei> <hitesh> Basically we will have to build a wrapper as the algorithm they use is dependent on hindi <spectei> <spectei> no <spectei> <spectei> we are going to convert it <spectei> <spectei> no wrappers <spectei> <Opsrc> yeah i'd convert wx into hindi with that script .. <Opsrc> I talked to the one who coded that morph analyser, he also agreed tht ther's no way to internally do that out .. <spectei> why ? <spectei> did you tell him what my suggestion was ? <spectei> and that i already converted nouns and adjectives that way ? <Opsrc> ohh u did it tht way ?? <spectei> i showed you :( <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi/apertium-ur-hi.hi.dix <Opsrc> no i din tel him abt that , just we discussed about my work and in between we had this discussion about morph analyser .. <spectei> this was converted automatically <spectei> from the hi_WX file <Opsrc> ohh i'm looking to do it that way .. <Opsrc> heh imself had no suggestion wht to do for any internal modification ..