Difference between revisions of "Talk:Hindi and Urdu"

Revision as of 11:19, 1 April 2011

Message #1

Hi, 

you need to convert the IIIT morphological analyser to be compatible with Apertium. Thus:

1) The encoding needs to be changed from WX -> UTF-8
2) The tagset needs to be standardised along Apertium lines

These two tasks are non-negotiable and should be completed as part of your application.

You can find the incomplete language pair in SVN:

https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi

The analyser is partially converted (by me) here:

https://apertium.svn.sourceforge.net/svnroot/apertium/nursery/apertium-ur-hi/apertium-ur-hi.hi.dix

Analyse Urdu:

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin 
^عامل/عامل<np><ant><m><sg><nom>$ ^کی/کا<post><f><sg><nom>$ ^بیٹی/بیٹی<n><f><sg><nom>/بیٹی<n><f><sg><obl>/بیٹی<n><f><sg><voc>$

Analyse Hindi:

$ echo "आमिल की बेटी" | lt-proc hi-ur.automorf.bin
^आमिल/आमिल<np><ant><m><sg><nom>$ ^की/का<post><f><sg><nom>/का<post><f><sg><obl>/का<post><f><pl><nom>/का<post><f><pl><obl>$ ^बेटी/बेटी<n><f><sg><nom>/बेटी<n><f><sg><obl>$

Tag Urdu:

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob 
^عامل<np><ant><m><sg><nom>$ ^کا<post><f><sg><nom>$ ^بیٹی<n><f><sg><nom>$

Transfer Urdu->Hindi

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob  | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin
^आमिल<np><ant><m><sg><nom>$ ^का<post><f><sg><nom>$ ^बेटी<n><f><sg><nom>$

Generate Hindi:

$ echo "عامل کی بیٹی" | lt-proc ur-hi.automorf.bin | apertium-tagger -g ur-hi.prob  | apertium-transfer apertium-ur-hi.ur-hi.t1x ur-hi.t1x.bin ur-hi.autobil.bin | lt-proc -g ur-hi.autogen.bin
आमिल की बेटी

Best regards,

Fran

Message #2

<hitesh> Hi
<spectei> hi
<hitesh> We were having a discussion about IIIT morphological parser yesterday
<spectei> cool
<spectei> you and who ? :)
<hitesh> and you said that we need to convert it into unicode
<hitesh> Do you want the output to be in unicode or what as the current output format is SSF
<spectei> output = unicode
<spectei> internal representation = unicode
<spectei> input = unicode
<hitesh> What is the advantage
<spectei> that we don't have to have any special processing for hindi
<spectei> that everything else in the project is in unicode
<spectei> that we don't have to do any special processing to the input/output text
<spectei> that we don't have to include other programs in apertium just for hindi
<hitesh> Basically we will have to build a wrapper as the algorithm they use is dependent on hindi
<spectei> no
<spectei> we are going to convert it
<spectei> no wrappers
<hitesh> Ok , so we will build the morphological analyzer from scratch
<spectei> you could do it that way if you want
<spectei> the alternative is to generate a full form list
<spectei> convert to unicode
<spectei> and re-infer paradigms from that
<spectei> that's what i did for adjectives and nouns
<spectei> where i more or less understand the inflection
<spectei> and your tagset
<spectei> for verbs i don't
<hitesh> We use FSTs in this , right so how we re-infer the paradigms
<spectei> you generate the possibilities 
<spectei> and then merge them
<spectei> http://wiki.apertium.org/wiki/Speling_format
<spectei> http://wiki.apertium.org/wiki/Speling_tools
<spectei> http://wiki.apertium.org/wiki/Paradigm_chopper
<spectei> and we use FSTs too

Message #3

[16:49] <spectei> NepaliKoChoro_, please document what i say
[16:49] <spectei> on the wiki
[16:49] <spectei> so i don't have to repeat it to the next guy :(
[16:49] <NepaliKoChoro_> yes i read it :)
[16:49] <spectei> $ lt-expand apertium-hi.hi_WX.dix | head
[16:49] <spectei> aBAgA:aBAgA<cat:adj><case:d><num:s><gen:m>
[16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:s><gen:m>
[16:49] <spectei> aBAge:aBAgA<cat:adj><case:d><num:p><gen:m>
[16:49] <spectei> aBAge:aBAgA<cat:adj><case:o><num:p><gen:m>
[16:50] <spectei> aBAgA:aBAgA<cat:n><num:s><case:d><gen:m>
[16:50] <spectei> aBAge:aBAgA<cat:n><num:p><case:d><gen:m>
[16:50] <spectei> aBAge:aBAgA<cat:n><num:s><case:o><gen:m>
[16:50] <spectei> aBAgoM:aBAgA<cat:n><num:p><case:o><gen:m>
[16:50] <spectei> aBAgI:aBAgI<cat:adj><case:any><num:any><gen:f>
[16:50] <spectei> aBagna:aBagna<cat:adj><gen:any><num:any><case:any>
[16:50] <spectei>  
[16:50] <spectei> this will give you a full form list
[16:50] <NepaliKoChoro_> okay
[16:50] <spectei> $ lt-expand apertium-hi.hi_WX.dix | grep '<cat:v>' | head -3
[16:50] <spectei> acakacAUz:acakacA<cat:v><per:u><num:s><gen:any><tam:subj>
[16:50] <spectei> acakacAye:acakacA<cat:v><per:m><num:s><gen:any><tam:subj>
[16:50] <spectei> acakacAe:acakacA<cat:v><per:m><num:s><gen:any><tam:subj>
[16:50] <spectei>  
[16:50] <spectei> this will give you verbs
[16:50] <spectei> you convert the tags here
[16:50] <spectei> to apertium ones
[16:50] <spectei> and you convert the words
[16:50] <spectei> to unicode
[16:50] == rahul [~Rahul@117.211.88.150] has joined #apertium
[16:50] <spectei> then you convert the whole thing to speling format
[16:51] <spectei> then you run speling-paradigms
[16:51] <spectei> then you run paradigm-chopper
[16:51] <spectei> then you check the result
[16:51] <spectei>  
[16:51] <spectei> got it ?
[16:51] <NepaliKoChoro_> yes
[16:51] <NepaliKoChoro_> will be onto it
[16:51] <spectei> ok
[16:51] <NepaliKoChoro_> thank you very much

Difference between revisions of "Talk:Hindi and Urdu"

Revision as of 11:19, 1 April 2011

Message #1

Message #2

Message #3

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools