Talk:Apertium New Language Pair HOWTO
Possibly, at the very end, one could mention the possibility of typing
make sh-en.t1x.bin
etc. instead of all those different commands, for the language pairs priviliged enough to have fancy makefiles. --Unhammer
Contents
Files
apertium-sh-en.sh-en.dix
<?xml version="1.0" encoding="UTF-8"?> <dictionary> <alphabet/> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> <sdef n="vblex"/> </sdefs> <section id="main" type="standard"> <e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e> <e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e> </section> </dictionary>
apertium-sh-en.sh.dix
<?xml version="1.0" encoding="UTF-8"?> <dictionary> <alphabet>ABCCCDDzZEFGHIJKLLjMNNjOPRSŠTUVZŽabcćčddžefghijklljmnnjoprsštuvzž</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> <sdef n="vblex"/> <sdef n="p1"/> <sdef n="pri"/> </sdefs> <pardefs> <pardef n="gramofon__n"> <e> <p> <l/> <r><s n="n"/><s n="sg"/></r> </p> </e> <e> <p> <l>i</l> <r><s n="n"/><s n="pl"/></r> </p> </e> <e> <p> <l>e</l> <r><s n="n"/><s n="pl"/></r> </p> </e> </pardef> <pardef n="vid/eti__vblex"> <e> <p> <l>im</l> <r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r> </p> </e> <e> <p> <l>imo</l> <r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="pl"/></r> </p> </e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e> <e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e> </section> </dictionary>
apertium-sh-en.sh-en.t1x
<?xml version="1.0" encoding="UTF-8"?> <transfer> <section-def-cats> <def-cat n="nom"> <cat-item tags="n.*"/> </def-cat> <def-cat n="vrb"> <cat-item tags="vblex.*"/> </def-cat> <def-cat n="prpers"> <cat-item lemma="prpers" tags="prn.*"/> </def-cat> </section-def-cats> <section-def-attrs> <def-attr n="nbr"> <attr-item tags="sg"/> <attr-item tags="pl"/> </def-attr> <def-attr n="a_nom"> <attr-item tags="n"/> </def-attr> <def-attr n="temps"> <attr-item tags="pri"/> </def-attr> <def-attr n="pers"> <attr-item tags="p1"/> </def-attr> <def-attr n="a_verb"> <attr-item tags="vblex"/> </def-attr> <def-attr n="tipus_prn"> <attr-item tags="prn.subj"/> <attr-item tags="prn.obj"/> </def-attr> </section-def-attrs> <section-def-vars> <def-var n="number"/> </section-def-vars> <section-rules> <rule> <pattern> <pattern-item n="nom"/> </pattern> <action> <out> <lu> <clip pos="1" side="tl" part="lem"/> <clip pos="1" side="tl" part="a_nom"/> <clip pos="1" side="tl" part="nbr"/> </lu> </out> </action> </rule> <rule> <pattern> <pattern-item n="vrb"/> </pattern> <action> <out> <lu> <lit v="prpers"/> <lit-tag v="prn"/> <lit-tag v="subj"/> <clip pos="1" side="tl" part="pers"/> <clip pos="1" side="tl" part="nbr"/> </lu> <b/> <lu> <clip pos="1" side="tl" part="lem"/> <clip pos="1" side="tl" part="a_verb"/> <clip pos="1" side="tl" part="temps"/> </lu> </out> </action> </rule> </section-rules> </transfer>
apertium-sh-en.en.dix
<?xml version="1.0" encoding="UTF-8"?> <dictionary> <alphabet>ABCCCDDzZEFGHIJKLLjMNNjOPRSŠTUVZŽabcćčddžefghijklljmnnjoprsštuvzž</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> <sdef n="vblex"/> <sdef n="p1"/> <sdef n="pri"/> <sdef n="prn"/> <sdef n="subj"/> </sdefs> <pardefs> <pardef n="gramophone__n"> <e> <p> <l/> <r><s n="n"/><s n="sg"/></r> </p> </e> <e> <p> <l>s</l> <r><s n="n"/><s n="pl"/></r> </p> </e> </pardef> <pardef n="s/ee__vblex"> <e> <p> <l>ee</l> <r>ee<s n="vblex"/><s n="pri"/></r> </p> </e> </pardef> <pardef n="prsubj__prn"> <e> <p> <l>I</l> <r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r> </p> </e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="gramophone"><i>gramophone</i><par n="gramophone__n"/></e> <e lm="see"><i>s</i><par n="s/ee__vblex"/></e> <e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e> <e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e> </section> </dictionary>
buildall.sh:
lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin lt-comp rl apertium-sh-en.en.dix en-sh.autogen.bin lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin
test.sh:
echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \ gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \ apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \ lt-proc -g sh-en.autogen.bin
Discussion
output:
echo "gramofoni" | lt-proc sh-en.automorf.bin | \ > gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \ > apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \ > lt-proc -g sh-en.autogen.bin #gramophone echo "vidim" | lt-proc sh-en.automorf.bin | \ > gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \ > apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \ > lt-proc -g sh-en.autogen.bin #prpers #see echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \ gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \ apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \ lt-proc -g sh-en.autogen.bin #prpers #see
- Can you show the output of each stage ? - Francis Tyers 09:20, 30 April 2009 (UTC)
- Ok, I see the error. Please change your
buildall.sh
to:
- Ok, I see the error. Please change your
lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin lt-comp rl apertium-sh-en.en.dix sh-en.autogen.bin lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin
- Francis Tyers 09:50, 30 April 2009 (UTC)
en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6/en-sh$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \ > gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \ > apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \ > lt-proc -g sh-en.autogen.bin I see gramophones
Muki987 10:57, 30 April 2009 (UTC)
- Perfect! :) Is that an error in the HOWTO ? Or was it in your script? - Francis Tyers 11:50, 30 April 2009 (UTC)
- It was the error in the build script. IMHO best would be to add the complete sources including the build script to the web page, to help beginners. It is almost impossible to get it working with the present description, since building is - as far as I can see- not documented. Thanks for your help. Muki987 12:41, 30 April 2009 (UTC)
- Every stage is documented. It might be worth giving a link to a "finished piece", but part of the aim of the HOWTO is to get people starting to make the files themselves. Even just typing out something helps in my experience. And I think all of the build steps are given on the page, just not in order. What might be a good idea is to put in all the steps in one place in the end? - Francis Tyers 13:08, 30 April 2009 (UTC)
- For the sake of precisity I checked:
- necessary to eliminate line <clip pos="1" side="tl" part="lem"/> from t1x file
- necessary to add
<e> <p> <l>e</l> <r><s n="n"/><s n="pl"/></r> </p> </e>
to sh-en.sh.dix file
- not necessary to modify gramofon__n to gramophone__n in sh-en.en.dix
Muki987 13:09, 30 April 2009 (UTC)
- The build thing was an error in the page here. - Francis Tyers 13:10, 30 April 2009 (UTC)
- Yes, building was the third error in my original setup (in fact, a build script was not included, but as we see here, it is necessary). Muki987 14:13, 30 April 2009 (UTC)
Regarding the second problem, there is a paragraph of text that mentions it:
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, just copy the 'e' block for 'i' and change the 'i' to 'e' there.
Although perhaps I should put in the XML. As far as I can tell the transfer rule is fine, and the error was caused by a typo no? - Francis Tyers 15:03, 30 April 2009 (UTC)
- sloppiness on my side. If all data+build script is available for the user, this kind of errors disappear. Muki987 19:03, 30 April 2009 (UTC)
Documentation ideas
- How about if I just put a pointer to the talk page for copies of the "complete" files? It's just that I think a step-by-step approach really can help people get a grip on the format... or I wouldn't have done it like this :s - Francis Tyers 21:08, 30 April 2009 (UTC)
- That's fine. Also a bit more explanation about what does <clip> do would be good. What does after <clip pos="1" side="tl" part="whatever" mean? I still do not understand that 100%. Muki987 20:39, 1 May 2009 (UTC)
- How about if I just put a pointer to the talk page for copies of the "complete" files? It's just that I think a step-by-step approach really can help people get a grip on the format... or I wouldn't have done it like this :s - Francis Tyers 21:08, 30 April 2009 (UTC)
Yes, I wrote the HOWTO when I didn't really understand either ;) ... Basically you define regular your "attributes" in def-attr
and <clip ...
"pulls" these out of the lexical unit. So for example
<section-def-attrs> <def-attr n="a_nom"> <attr-item n="n"/> <attr-item n="np"/> <attr-item n="np.top"/> <attr-item n="np.al"/> </def-attr> <def-attr n="nbr"> <attr-item n="sg"/> <attr-item n="pl"/> </def-attr> <def-attr n="gen"> <attr-item n="f"/> <attr-item n="m"/> <attr-item n="nt"/> </def-attr> </section-def-attrs> ... <clip pos="1" side="tl" part="a_nom"/> <!-- can be 'n', 'np', 'np.al' or 'np.top' , if input stream is ^foo<np>$ it will be <np> and if input stream is ^foo<np><top><m><sg>$ it will be <np><top> --> <clip pos="1" side="tl" part="gen"/> <!-- can be 'f', 'm' or 'nt', if input stream is ^foo<n><m><sg>$ it will be <m> --> <clip pos="1" side="tl" part="nbr"/> <!-- can be 'sg' or 'pl', if input stream is ^foo<n><sg>$ it will be <sg> -->
It is basically for extracting _parts_ of the information out of the lexical unit. - Francis Tyers 21:06, 1 May 2009 (UTC)
- As usual. One can not understand everything immediately. I think, the description is very helpful, at least it helped me a lot. Without it one is completely lost. No practice, no sense, at least for me. Is pos always "1" and side always "tl"? if yes, what is their purpose? Muki987 09:49, 2 May 2009 (UTC)
- side can be "tl" (target language) or "sl" (source language), pos = position, so if you have a rule which matches 4 lexical units, pos="1" will be the first, pos="2" will be the second etc. - Francis Tyers 10:57, 2 May 2009 (UTC)
faking the tagger (gawk vs wget)
Little suggestion: the gawk script in the Transfer part looks confusing, and newbies won't ever use it again once they have a tagger. How about instead saying
We don't have a POS tagger yet, but we can yank one from another language pair for now. It'll perform badly but we can make our own tagger later. First do
$ wget http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-en-es/en-es.prob -O sh-en.prob
(alternatively, follow that link and save the file as "sh-en.prob" in your working folder). You can try the tagger with the following command:
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | apertium-tagger -p sh-en.prob
The 'number' var is not used
The global variable 'number' defined in the transfer rule file doesn't seem to be useful for this tutorial. --Grégoire 11:48, 17 August 2012 (UTC)
- The DTD requires the section "section-def-vars", which is why it is included. Perhaps a note could be put in a comment unused ? - Francis Tyers 16:48, 17 August 2012 (UTC)
the '@' sign
The '@' sign in the output is never explained. And according to this page, Apertium stream format, it means that there is an untranslated lemma. But that doesn't seem to be the case in any example... --Grégoire 11:50, 17 August 2012 (UTC)
- I added a comment :) - Francis Tyers 19:04, 17 August 2012 (UTC)
Modes
Given this is a key wikipage, it ought to mention them. Or maybe now be rewritten to use them, and introduce the direct use of the tools as a gentle suggestion? Big job, though.