Difference between revisions of "Talk:Hindi and Urdu"
Jump to navigation
Jump to search
m (Documented on the wiki as spectei said) |
|||
Line 88: | Line 88: | ||
<spectei> http://wiki.apertium.org/wiki/Paradigm_chopper |
<spectei> http://wiki.apertium.org/wiki/Paradigm_chopper |
||
<spectei> and we use FSTs too |
<spectei> and we use FSTs too |
||
</pre> |
|||
==Message #3== |
|||
<pre> |
|||
[16:49] <spectei> NepaliKoChoro_, please document what i say |
|||
[16:49] <spectei> on the wiki |
|||
[16:49] <spectei> so i don't have to repeat it to the next guy :( |
|||
[16:49] <NepaliKoChoro_> yes i read it :) |
|||
[16:49] <spectei> $ lt-expand apertium-hi.hi_WX.dix | head |
|||
[16:49] <spectei> aBAgA:aBAgA<cat:adj><case:d><num:s><gen:m> |
|||
[16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:s><gen:m> |
|||
[16:49] <spectei> aBAge:aBAgA<cat:adj><case:d><num:p><gen:m> |
|||
[16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:p><gen:m> |
|||
[16:50] <spectei> aBAgA:aBAgA<cat:n><num:s><case:d><gen:m> |
|||
[16:50] <spectei> aBAge:aBAgA<cat:n><num:p><case:d><gen:m> |
|||
[16:50] <spectei> aBAge:aBAgA<cat:n><num:s><case:o><gen:m> |
|||
[16:50] <spectei> aBAgoM:aBAgA<cat:n><num:p><case:o><gen:m> |
|||
[16:50] <spectei> aBAgI:aBAgI<cat:adj><case:any><num:any><gen:f> |
|||
[16:50] <spectei> aBagna:aBagna<cat:adj><gen:any><num:any><case:any> |
|||
[16:50] <spectei> |
|||
[16:50] <spectei> this will give you a full form list |
|||
[16:50] <NepaliKoChoro_> okay |
|||
[16:50] <spectei> $ lt-expand apertium-hi.hi_WX.dix | grep '<cat:v>' | head -3 |
|||
[16:50] <spectei> acakacAUz:acakacA<cat:v><per:u><num:s><gen:any><tam:subj> |
|||
[16:50] <spectei> acakacAye:acakacA<cat:v><per:m><num:s><gen:any><tam:subj> |
|||
[16:50] <spectei> acakacAe:acakacA<cat:v><per:m><num:s><gen:any><tam:subj> |
|||
[16:50] <spectei> |
|||
[16:50] <spectei> this will give you verbs |
|||
[16:50] <spectei> you convert the tags here |
|||
[16:50] <spectei> to apertium ones |
|||
[16:50] <spectei> and you convert the words |
|||
[16:50] <spectei> to unicode |
|||
[16:50] == rahul [~Rahul@117.211.88.150] has joined #apertium |
|||
[16:50] <spectei> then you convert the whole thing to speling format |
|||
[16:51] <spectei> then you run speling-paradigms |
|||
[16:51] <spectei> then you run paradigm-chopper |
|||
[16:51] <spectei> then you check the result |
|||
[16:51] <spectei> |
|||
[16:51] <spectei> got it ? |
|||
[16:51] <NepaliKoChoro_> yes |
|||
[16:51] <NepaliKoChoro_> will be onto it |
|||
[16:51] <spectei> ok |
|||
[16:51] <NepaliKoChoro_> thank you very much |
|||
</pre> |
</pre> |
Revision as of 11:19, 1 April 2011
Message #1
Hi, you need to convert the IIIT morphological analyser to be compatible with Apertium. Thus: 1) The encoding needs to be changed from WX -> UTF-8 2) The tagset needs to be standardised along Apertium lines These two tasks are non-negotiable and should be completed as part of your application. You can find the incomplete language pair in SVN: https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi The analyser is partially converted (by me) here: https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi/apertium-ur-hi.hi.dix Analyse Urdu: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin ^عامل/عامل<np><ant><m><sg><nom>$ ^کی/کا<post><f><sg><nom>$ ^بیٹی/بیٹی<n><f><sg><nom>/بیٹی<n><f><sg><obl>/بیٹی<n><f><sg><voc>$ Analyse Hindi: $ echo "आमिल की बेटी" | lt-proc hi-ur.automorf.bin ^आमिल/आमिल<np><ant><m><sg><nom>$ ^की/का<post><f><sg><nom>/का<post><f><sg><obl>/का<post><f><pl><nom>/का<post><f><pl><obl>$ ^बेटी/बेटी<n><f><sg><nom>/बेटी<n><f><sg><obl>$ Tag Urdu: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob ^عامل<np><ant><m><sg><nom>$ ^کا<post><f><sg><nom>$ ^بیٹی<n><f><sg><nom>$ Transfer Urdu->Hindi $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin ^आमिल<np><ant><m><sg><nom>$ ^का<post><f><sg><nom>$ ^बेटी<n><f><sg><nom>$ Generate Hindi: $ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin | lt-proc -g ur-hi.autogen.bin आमिल की बेटी Best regards, Fran
Message #2
<hitesh> Hi <spectei> hi <hitesh> We were having a discussion about IIIT morphological parser yesterday <spectei> cool <spectei> you and who ? :) <hitesh> and you said that we need to convert it into unicode <hitesh> Do you want the output to be in unicode or what as the current output format is SSF <spectei> output = unicode <spectei> internal representation = unicode <spectei> input = unicode <hitesh> What is the advantage <spectei> that we don't have to have any special processing for hindi <spectei> that everything else in the project is in unicode <spectei> that we don't have to do any special processing to the input/output text <spectei> that we don't have to include other programs in apertium just for hindi <hitesh> Basically we will have to build a wrapper as the algorithm they use is dependent on hindi <spectei> no <spectei> we are going to convert it <spectei> no wrappers <hitesh> Ok , so we will build the morphological analyzer from scratch <spectei> you could do it that way if you want <spectei> the alternative is to generate a full form list <spectei> convert to unicode <spectei> and re-infer paradigms from that <spectei> that's what i did for adjectives and nouns <spectei> where i more or less understand the inflection <spectei> and your tagset <spectei> for verbs i don't <hitesh> We use FSTs in this , right so how we re-infer the paradigms <spectei> you generate the possibilities <spectei> and then merge them <spectei> http://wiki.apertium.org/wiki/Speling_format <spectei> http://wiki.apertium.org/wiki/Speling_tools <spectei> http://wiki.apertium.org/wiki/Paradigm_chopper <spectei> and we use FSTs too
Message #3
[16:49] <spectei> NepaliKoChoro_, please document what i say [16:49] <spectei> on the wiki [16:49] <spectei> so i don't have to repeat it to the next guy :( [16:49] <NepaliKoChoro_> yes i read it :) [16:49] <spectei> $ lt-expand apertium-hi.hi_WX.dix | head [16:49] <spectei> aBAgA:aBAgA<cat:adj><case:d><num:s><gen:m> [16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:s><gen:m> [16:49] <spectei> aBAge:aBAgA<cat:adj><case:d><num:p><gen:m> [16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:p><gen:m> [16:50] <spectei> aBAgA:aBAgA<cat:n><num:s><case:d><gen:m> [16:50] <spectei> aBAge:aBAgA<cat:n><num:p><case:d><gen:m> [16:50] <spectei> aBAge:aBAgA<cat:n><num:s><case:o><gen:m> [16:50] <spectei> aBAgoM:aBAgA<cat:n><num:p><case:o><gen:m> [16:50] <spectei> aBAgI:aBAgI<cat:adj><case:any><num:any><gen:f> [16:50] <spectei> aBagna:aBagna<cat:adj><gen:any><num:any><case:any> [16:50] <spectei> [16:50] <spectei> this will give you a full form list [16:50] <NepaliKoChoro_> okay [16:50] <spectei> $ lt-expand apertium-hi.hi_WX.dix | grep '<cat:v>' | head -3 [16:50] <spectei> acakacAUz:acakacA<cat:v><per:u><num:s><gen:any><tam:subj> [16:50] <spectei> acakacAye:acakacA<cat:v><per:m><num:s><gen:any><tam:subj> [16:50] <spectei> acakacAe:acakacA<cat:v><per:m><num:s><gen:any><tam:subj> [16:50] <spectei> [16:50] <spectei> this will give you verbs [16:50] <spectei> you convert the tags here [16:50] <spectei> to apertium ones [16:50] <spectei> and you convert the words [16:50] <spectei> to unicode [16:50] == rahul [~Rahul@117.211.88.150] has joined #apertium [16:50] <spectei> then you convert the whole thing to speling format [16:51] <spectei> then you run speling-paradigms [16:51] <spectei> then you run paradigm-chopper [16:51] <spectei> then you check the result [16:51] <spectei> [16:51] <spectei> got it ? [16:51] <NepaliKoChoro_> yes [16:51] <NepaliKoChoro_> will be onto it [16:51] <spectei> ok [16:51] <NepaliKoChoro_> thank you very much