https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Gr%C3%A9goire&feedformat=atomApertium - User contributions [en]2024-03-28T10:50:08ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=Minimal_installation_from_SVN&diff=49298Minimal installation from SVN2014-07-09T09:34:02Z<p>Grégoire: /* Configure, build and install */</p>
<hr />
<div>{{TOCD}}<br />
This guide shows you how to download, configure, compile and install core apertium packages and language data. It assumes you've already installed the '''prerequisites''' for your system – if you have not, see the system-specific guides under [[Installation]]. If you run into trouble, see [[Installation troubleshooting]].<br />
<br />
''Note: some pairs require more than the four packages describe here. See the bottom of this page if your language pair complains about lacking CG, HFST or language data like <code>apertium-rus</code>.''<br />
<br />
==Installing apertium and a language pair==<br />
<br />
===Download===<br />
For most language pairs, these are the packages you need:<br />
<br />
* lttoolbox<br />
* apertium<br />
* apertium-lex-tools<br />
* the language pair(s) your are interested in<br />
<br />
Here are the commands if you would like the Esperanto-English pair:<br />
<pre><br />
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/lttoolbox<br />
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium<br />
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-lex-tools<br />
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-eo-en<br />
</pre><br />
<br />
''Note: please make sure that the directory where you put these files (i.e. where you run the svn command) doesn't contain spaces and other special characters. That may cause errors while compiling/linking.''<br />
<br />
If you want another pair than eo-en, only the last line needs changing. To see the available 'released' language pairs, go to https://svn.code.sf.net/p/apertium/svn/trunk/ (pairs which are in development are in the incubator/nursery/staging subdirectories of https://svn.code.sf.net/p/apertium/svn/).<br />
<br />
If a language pair has more dependencies than the three shown above, the <code>README</code> should mention it (and the <code>autogen.sh</code> step should fail with a message about what is missing). The bottom of this page has pointers on how to install other possible dependencies.<br />
<br />
===Set up environment===<br />
By default, Apertium is installed under the directory <code>/usr/local</code>, which requires root (sudo) access when installing. If that's fine with you, begin by pasting these lines into your terminal:<br />
<pre><br />
LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}<br />
export LD_LIBRARY_PATH<br />
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:${PKG_CONFIG_PATH}<br />
export PKG_CONFIG_PATH<br />
</pre><br />
You should also put those lines in your <code>~/.bashrc</code> so you don't have to paste them into every terminal you open.<br />
<br />
However, if you want it installed somewhere else or don't want to install it as root, instead paste these lines into your terminal:<br />
<pre><br />
PREFIX=$HOME/local # or wherever you want apertium stuff installed<br />
LD_LIBRARY_PATH=$PREFIX/lib:${LD_LIBRARY_PATH}<br />
export LD_LIBRARY_PATH<br />
PKG_CONFIG_PATH=$PREFIX/lib/pkgconfig:${PKG_CONFIG_PATH}<br />
export PKG_CONFIG_PATH<br />
</pre><br />
You should also put those lines in your <code>~/.bashrc</code> so you don't have to paste them into every terminal you open.<br />
<br />
===Configure, build and install===<br />
The next step is to configure, build and install each of the modules you checked out, in this order:<br />
# <code>lttoolbox</code><br />
# <code>apertium</code><br />
# <code>apertium-lex-tools</code><br />
# the language pair (e.g. <code>apertium-eo-en</code>)<br />
<br />
<code>cd</code> to each of the directories before you run the the commands shown below.<br />
<br />
If you didn't specify <code>$PREFIX</code> above, or don't know what this means, then do this in each directory:<br />
<pre><br />
./autogen.sh<br />
make<br />
</pre><br />
<br />
Then, for all programs '''apart from the language pair''', do:<br />
<br />
<pre><br />
sudo make install<br />
sudo ldconfig<br />
</pre><br />
<br />
If you specified a <code>$PREFIX</code> (e.g. to avoid installing as root), then do this in each directory:<br />
<pre><br />
./autogen.sh --prefix=$PREFIX<br />
make<br />
</pre><br />
<br />
Then, for all programs '''apart from the language pair''', do:<br />
<br />
<pre><br />
make install<br />
ldconfig -n $PREFIX/lib<br />
</pre><br />
<br />
<br />
(If you're on a Mac, you don't need to do ldconfig, don't worry that it fails.)<br />
<br />
<br />
If you had any trouble, see [[Installation troubleshooting]].<br />
<br />
===Test===<br />
Now test that it works. The command <code>apertium -l</code> should show a list of translation directions, of the form "from-to". Pick one, and do<br />
<pre><br />
echo 'This is a test sentence.' | apertium from-to<br />
</pre><br />
replacing from-to with the direction you want.<br />
<br />
You can see development translation modes if you do <code>ls modes</code> from the language pair directory. If you're in the language pair directory, and there is e.g. a file <code>modes/eo-en-tagger.mode</code>, you can run the translator up until the tagger by typing<br />
<pre><br />
echo 'This is a test sentence.' | apertium -d . eo-en-tagger<br />
</pre><br />
<br />
The <code>-d .</code> means "use the language data in this directory", and is useful if you don't want to type <code>make install</code> all the time.<br />
<br />
==For language pairs that depend on monolingual packages (apertium-XYZ) ==<br />
Many language pairs now have their monolingual data in separate packages (so that when several pairs have one language in common, we don't have to duplicate the data). If a pair depends on a monolingual package, the README should say so, and also the <code>autogen.sh</code> step should fail with a message like <pre>No package 'apertium-XYZ' found</pre> (where XYZ is some language code).<br />
<br />
Monolingual packages are typically kept in https://svn.code.sf.net/p/apertium/svn/languages/ (more info at [[Languages]]) and compiled like the other packages.<br />
If a monolingual package installs a dictionary, the language pair uses that installed dictionary when compiling. However, to avoid having to type <code>make install</code> in the monolingual directory after every change there, you can tell the language pair the exact location to the monolingual package, and it will use the dictionary from that directory instead of the installed one. This is recommended for developers.<br />
<br />
Imagine the language pair is called apertium-fie-bar, and it depends on the monolingual packages apertium-fie and apertium-bar. Assuming we have already installed lttoolbox, apertium and apertium-lex-tools as shown above, these would be the steps to download, configure and install apertium-fie-bar:<br />
<pre><br />
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-fie-bar<br />
svn checkout https://svn.code.sf.net/p/apertium/svn/languages/apertium-fie<br />
svn checkout https://svn.code.sf.net/p/apertium/svn/languages/apertium-bar<br />
<br />
cd apertium-fie<br />
./autogen.sh<br />
make<br />
cd ..<br />
<br />
cd apertium-bar<br />
./autogen.sh<br />
make<br />
cd ..<br />
<br />
cd apertium-fie-bar<br />
./autogen.sh --with-lang1=../apertium-fie --with-lang2=../apertium-bar<br />
make<br />
</pre><br />
The <code>--with-lang1</code> is used to give the path to where you checked out apertium-fie. If you do <code>./autogen.sh --help</code>, it will tell you the possible <code>--with-langN</code> options and what they correspond to.<br />
<br />
The process is similar for other language pairs that use monolingual packages.<br />
<br />
==For language pairs that use CG (vislcg3 / cg-proc / cg-comp) ==<br />
Many language pairs now use [[Constraint Grammar]] (e.g. Macedonian→English, Breton→French, Nynorsk-Bokmål, …). For these, you need <code>vislcg3</code> beforehand. See [[Vislcg3#Installing_VISL_CG3]] for installation (use <code>./cmake.sh -DCMAKE_INSTALL_PREFIX=<prefix></code> if you're installing to a prefix).<br />
<br />
Note that you have to have [http://site.icu-project.org/ ICU] installed beforehand (available through most GNU/Linux package managers, in Arch Linux as <code>icu</code>, in Debian/Ubuntu as <code>libicu-dev</code>, in Macports as <code>icu</code>).<br />
<br />
<br />
==For language pairs that use HFST (hfst-proc / hfst-lexc / hfst-twolc)==<br />
Many language pairs now use HFST (e.g. the Turkic and Saami ones). For these, you need <code>hfst</code> and <code>foma</code> beforehand. Follow the installation guides first for [[Foma]], then [[HFST]].<br />
<br />
<br />
==See also==<br />
* [[Installation]] – prerequisites and specific info for many different operating systems<br />
* [[Installation Troubleshooting]]<br />
<br />
<br />
[[Category:Documentation]]<br />
[[Category:Installation]]<br />
[[Category:Documentation in English]]</div>Grégoirehttps://wiki.apertium.org/w/index.php?title=Talk:Apertium_New_Language_Pair_HOWTO&diff=35839Talk:Apertium New Language Pair HOWTO2012-08-17T11:50:43Z<p>Grégoire: /* the '@' sign */ new section</p>
<hr />
<div>Possibly, at the very end, one could mention the possibility of typing <br />
<br />
<code>make sh-en.t1x.bin</code><br />
<br />
etc. instead of all those different commands, for the language pairs priviliged enough to have fancy makefiles.<br />
--[[User:Unhammer|Unhammer]]<br />
<br />
==Files==<br />
apertium-sh-en.sh-en.dix<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet/><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
<sdef n="vblex"/><br />
</sdefs><br />
<br />
<section id="main" type="standard"><br />
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e><br />
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e><br />
</section><br />
</dictionary><br />
</pre><br />
<br />
apertium-sh-en.sh.dix<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet>ABCCCDDzZEFGHIJKLLjMNNjOPRSŠTUVZŽabcćčddžefghijklljmnnjoprsštuvzž</alphabet><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
</sdefs><br />
<br />
<pardefs><br />
<pardef n="gramofon__n"><br />
<e><br />
<p><br />
<l/><br />
<r><s n="n"/><s n="sg"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>i</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>e</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pardef><br />
<pardef n="vid/eti__vblex"><br />
<e><br />
<p><br />
<l>im</l><br />
<r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>imo</l><br />
<r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pardef><br />
<br />
</pardefs><br />
<br />
<section id="main" type="standard"><br />
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e><br />
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e><br />
</section><br />
<br />
<br />
</dictionary><br />
</pre><br />
<br />
apertium-sh-en.sh-en.t1x<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<transfer><br />
<section-def-cats><br />
<def-cat n="nom"><br />
<cat-item tags="n.*"/><br />
</def-cat><br />
<def-cat n="vrb"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
<def-cat n="prpers"><br />
<cat-item lemma="prpers" tags="prn.*"/><br />
</def-cat><br />
<br />
</section-def-cats><br />
<section-def-attrs><br />
<def-attr n="nbr"><br />
<attr-item tags="sg"/><br />
<attr-item tags="pl"/><br />
</def-attr><br />
<def-attr n="a_nom"><br />
<attr-item tags="n"/><br />
</def-attr><br />
<def-attr n="temps"><br />
<attr-item tags="pri"/><br />
</def-attr><br />
<def-attr n="pers"><br />
<attr-item tags="p1"/><br />
</def-attr><br />
<def-attr n="a_verb"><br />
<attr-item tags="vblex"/><br />
</def-attr><br />
<def-attr n="tipus_prn"><br />
<attr-item tags="prn.subj"/><br />
<attr-item tags="prn.obj"/><br />
</def-attr><br />
</section-def-attrs><br />
<br />
<section-def-vars><br />
<def-var n="number"/><br />
</section-def-vars><br />
<br />
<section-rules><br />
<rule><br />
<pattern><br />
<pattern-item n="nom"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_nom"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<lit v="prpers"/><br />
<lit-tag v="prn"/><br />
<lit-tag v="subj"/><br />
<clip pos="1" side="tl" part="pers"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
<b/><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</section-rules><br />
<br />
</transfer><br />
</pre><br />
<br />
<br />
apertium-sh-en.en.dix<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet>ABCCCDDzZEFGHIJKLLjMNNjOPRSŠTUVZŽabcćčddžefghijklljmnnjoprsštuvzž</alphabet><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
<sdef n="prn"/><br />
<sdef n="subj"/><br />
</sdefs><br />
<br />
<pardefs><br />
<pardef n="gramophone__n"><br />
<e><br />
<p><br />
<l/><br />
<r><s n="n"/><s n="sg"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>s</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pardef><br />
<pardef n="s/ee__vblex"><br />
<e><br />
<p><br />
<l>ee</l><br />
<r>ee<s n="vblex"/><s n="pri"/></r><br />
</p><br />
</e><br />
</pardef><br />
<pardef n="prsubj__prn"><br />
<e><br />
<p><br />
<l>I</l><br />
<r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r><br />
</p><br />
</e><br />
</pardef><br />
<br />
</pardefs><br />
<br />
<section id="main" type="standard"><br />
<e lm="gramophone"><i>gramophone</i><par n="gramophone__n"/></e><br />
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e><br />
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e><br />
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e><br />
</section><br />
<br />
<br />
</dictionary><br />
</pre><br />
<br />
<br />
buildall.sh:<br />
<pre><br />
lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
lt-comp rl apertium-sh-en.en.dix en-sh.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin <br />
<br />
apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
<br />
test.sh:<br />
<pre><br />
echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
</pre><br />
<br />
==Discussion==<br />
<br />
output:<br />
<br />
<pre><br />
echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
> gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
> apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
> lt-proc -g sh-en.autogen.bin<br />
#gramophone<br />
<br />
echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
> gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
> apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
> lt-proc -g sh-en.autogen.bin<br />
#prpers #see<br />
<br />
echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
#prpers #see<br />
</pre><br />
<br />
::Can you show the output of each stage ? - [[User:Francis Tyers|Francis Tyers]] 09:20, 30 April 2009 (UTC)<br />
<br />
:::Ok, I see the error. Please change your <code>buildall.sh</code> to:<br />
<br />
<pre><br />
lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
lt-comp rl apertium-sh-en.en.dix sh-en.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin <br />
<br />
apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
<br />
- [[User:Francis Tyers|Francis Tyers]] 09:50, 30 April 2009 (UTC)<br />
<br />
<pre><br />
en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6/en-sh$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
> gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
> apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
> lt-proc -g sh-en.autogen.bin<br />
I see gramophones<br />
</pre><br />
[[User:Muki987|Muki987]] 10:57, 30 April 2009 (UTC)<br />
<br />
::Perfect! :) Is that an error in the HOWTO ? Or was it in your script? - [[User:Francis Tyers|Francis Tyers]] 11:50, 30 April 2009 (UTC)<br />
<br />
::: It was the error in the build script. IMHO best would be to add the complete sources including the build script to the web page, to help beginners. It is almost impossible to get it working with the present description, since building is - as far as I can see- not documented. Thanks for your help. [[User:Muki987|Muki987]] 12:41, 30 April 2009 (UTC)<br />
<br />
:::: Every stage is documented. It might be worth giving a link to a "finished piece", but part of the aim of the HOWTO is to get people starting to make the files themselves. Even just typing out something helps in my experience. And I think all of the build steps are given on the page, just not in order. What might be a good idea is to put in all the steps in one place in the end? - [[User:Francis Tyers|Francis Tyers]] 13:08, 30 April 2009 (UTC)<br />
<br />
::: For the sake of precisity I checked:<br />
*necessary to eliminate line <clip pos="1" side="tl" part="lem"/> from t1x file<br />
* necessary to add <br />
<pre><br />
<e><br />
<p><br />
<l>e</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pre><br />
to sh-en.sh.dix file<br />
* not necessary to modify gramofon__n to gramophone__n in sh-en.en.dix<br />
<br />
[[User:Muki987|Muki987]] 13:09, 30 April 2009 (UTC)<br />
<br />
::::The build thing was an error in the page [http://wiki.apertium.org/w/index.php?title=Apertium_New_Language_Pair_HOWTO&diff=12290&oldid=11933 here]. - [[User:Francis Tyers|Francis Tyers]] 13:10, 30 April 2009 (UTC)<br />
<br />
::: Yes, building was the third error in my original setup (in fact, a build script was not included, but as we see here, it is necessary). [[User:Muki987|Muki987]] 14:13, 30 April 2009 (UTC)<br />
<br />
Regarding the second problem, there is a paragraph of text that mentions it:<br />
<br />
<blockquote><br />
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, just copy the 'e' block for 'i' and change the 'i' to 'e' there. <br />
</blockquote><br />
<br />
Although perhaps I should put in the XML. As far as I can tell the transfer rule is fine, and the error was caused by a typo no? - [[User:Francis Tyers|Francis Tyers]] 15:03, 30 April 2009 (UTC)<br />
<br />
:sloppiness on my side. If all data+build script is available for the user, this kind of errors disappear. [[User:Muki987|Muki987]] 19:03, 30 April 2009 (UTC)<br />
<br />
==Documentation ideas==<br />
:::How about if I just put a pointer to the talk page for copies of the "complete" files? It's just that I think a step-by-step approach really can help people get a grip on the format... or I wouldn't have done it like this :s - [[User:Francis Tyers|Francis Tyers]] 21:08, 30 April 2009 (UTC)<br />
::::That's fine. Also a bit more explanation about what does <clip> do would be good. What does after <clip pos="1" side="tl" part="whatever" mean? I still do not understand that 100%. [[User:Muki987|Muki987]] 20:39, 1 May 2009 (UTC)<br />
<br />
Yes, I wrote the HOWTO when I didn't really understand either ;) ... Basically you define regular your "attributes" in <code>def-attr</code> and <code><clip ...</code> "pulls" these out of the lexical unit. So for example<br />
<br />
<pre><br />
<section-def-attrs><br />
<def-attr n="a_nom"><br />
<attr-item n="n"/><br />
<attr-item n="np"/><br />
<attr-item n="np.top"/><br />
<attr-item n="np.al"/><br />
</def-attr> <br />
<def-attr n="nbr"><br />
<attr-item n="sg"/><br />
<attr-item n="pl"/><br />
</def-attr> <br />
<def-attr n="gen"><br />
<attr-item n="f"/><br />
<attr-item n="m"/><br />
<attr-item n="nt"/><br />
</def-attr> <br />
</section-def-attrs><br />
<br />
<br />
... <br />
<br />
<br />
<clip pos="1" side="tl" part="a_nom"/> <!-- can be 'n', 'np', 'np.al' or 'np.top' , if input stream is ^foo<np>$<br />
it will be <np> and if input stream is ^foo<np><top><m><sg>$ it will be <np><top> --><br />
<clip pos="1" side="tl" part="gen"/> <!-- can be 'f', 'm' or 'nt', if input stream is ^foo<n><m><sg>$ it will be <m> --><br />
<clip pos="1" side="tl" part="nbr"/> <!-- can be 'sg' or 'pl', if input stream is ^foo<n><sg>$ it will be <sg> --><br />
<br />
<br />
</pre><br />
<br />
It is basically for extracting _parts_ of the information out of the lexical unit. - [[User:Francis Tyers|Francis Tyers]] 21:06, 1 May 2009 (UTC)<br />
<br />
:As usual. One can not understand everything immediately. I think, the description is very helpful, at least it helped me a lot. Without it one is completely lost. No practice, no sense, at least for me. Is pos always "1" and side always "tl"? if yes, what is their purpose? [[User:Muki987|Muki987]] 09:49, 2 May 2009 (UTC)<br />
<br />
::side can be "tl" (target language) or "sl" (source language), pos = position, so if you have a rule which matches 4 lexical units, pos="1" will be the first, pos="2" will be the second etc. - [[User:Francis Tyers|Francis Tyers]] 10:57, 2 May 2009 (UTC)<br />
<br />
== faking the tagger (gawk vs wget) ==<br />
<br />
Little suggestion: the gawk script in the Transfer part looks confusing, and newbies won't ever use it again once they have a tagger. How about instead saying<br />
<br />
<br />
We don't have a POS tagger yet, but we can yank one from another language pair for now. It'll perform badly but we can make our own tagger later. First do<br />
<br />
$ wget http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-en-es/en-es.prob -O sh-en.prob<br />
<br />
(alternatively, follow that link and save the file as "sh-en.prob" in your working folder). You can try the tagger with the following command:<br />
<br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | apertium-tagger -p sh-en.prob<br />
<br />
== The 'number' var is not used ==<br />
<br />
The global variable 'number' defined in the transfer rule file doesn't seem to be useful for this tutorial.<br />
--[[User:Grégoire|Grégoire]] 11:48, 17 August 2012 (UTC)<br />
<br />
== the '@' sign ==<br />
<br />
The '@' sign in the output is never explained. And according to this page, [[Apertium stream format]], it means that there is an untranslated lemma. But that doesn't seem to be the case in any example... --[[User:Grégoire|Grégoire]] 11:50, 17 August 2012 (UTC)</div>Grégoirehttps://wiki.apertium.org/w/index.php?title=Talk:Apertium_New_Language_Pair_HOWTO&diff=35838Talk:Apertium New Language Pair HOWTO2012-08-17T11:48:01Z<p>Grégoire: /* The 'number' var is not used */ new section</p>
<hr />
<div>Possibly, at the very end, one could mention the possibility of typing <br />
<br />
<code>make sh-en.t1x.bin</code><br />
<br />
etc. instead of all those different commands, for the language pairs priviliged enough to have fancy makefiles.<br />
--[[User:Unhammer|Unhammer]]<br />
<br />
==Files==<br />
apertium-sh-en.sh-en.dix<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet/><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
<sdef n="vblex"/><br />
</sdefs><br />
<br />
<section id="main" type="standard"><br />
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e><br />
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e><br />
</section><br />
</dictionary><br />
</pre><br />
<br />
apertium-sh-en.sh.dix<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet>ABCCCDDzZEFGHIJKLLjMNNjOPRSŠTUVZŽabcćčddžefghijklljmnnjoprsštuvzž</alphabet><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
</sdefs><br />
<br />
<pardefs><br />
<pardef n="gramofon__n"><br />
<e><br />
<p><br />
<l/><br />
<r><s n="n"/><s n="sg"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>i</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>e</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pardef><br />
<pardef n="vid/eti__vblex"><br />
<e><br />
<p><br />
<l>im</l><br />
<r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>imo</l><br />
<r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pardef><br />
<br />
</pardefs><br />
<br />
<section id="main" type="standard"><br />
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e><br />
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e><br />
</section><br />
<br />
<br />
</dictionary><br />
</pre><br />
<br />
apertium-sh-en.sh-en.t1x<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<transfer><br />
<section-def-cats><br />
<def-cat n="nom"><br />
<cat-item tags="n.*"/><br />
</def-cat><br />
<def-cat n="vrb"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
<def-cat n="prpers"><br />
<cat-item lemma="prpers" tags="prn.*"/><br />
</def-cat><br />
<br />
</section-def-cats><br />
<section-def-attrs><br />
<def-attr n="nbr"><br />
<attr-item tags="sg"/><br />
<attr-item tags="pl"/><br />
</def-attr><br />
<def-attr n="a_nom"><br />
<attr-item tags="n"/><br />
</def-attr><br />
<def-attr n="temps"><br />
<attr-item tags="pri"/><br />
</def-attr><br />
<def-attr n="pers"><br />
<attr-item tags="p1"/><br />
</def-attr><br />
<def-attr n="a_verb"><br />
<attr-item tags="vblex"/><br />
</def-attr><br />
<def-attr n="tipus_prn"><br />
<attr-item tags="prn.subj"/><br />
<attr-item tags="prn.obj"/><br />
</def-attr><br />
</section-def-attrs><br />
<br />
<section-def-vars><br />
<def-var n="number"/><br />
</section-def-vars><br />
<br />
<section-rules><br />
<rule><br />
<pattern><br />
<pattern-item n="nom"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_nom"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<lit v="prpers"/><br />
<lit-tag v="prn"/><br />
<lit-tag v="subj"/><br />
<clip pos="1" side="tl" part="pers"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
<b/><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</section-rules><br />
<br />
</transfer><br />
</pre><br />
<br />
<br />
apertium-sh-en.en.dix<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet>ABCCCDDzZEFGHIJKLLjMNNjOPRSŠTUVZŽabcćčddžefghijklljmnnjoprsštuvzž</alphabet><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
<sdef n="prn"/><br />
<sdef n="subj"/><br />
</sdefs><br />
<br />
<pardefs><br />
<pardef n="gramophone__n"><br />
<e><br />
<p><br />
<l/><br />
<r><s n="n"/><s n="sg"/></r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>s</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pardef><br />
<pardef n="s/ee__vblex"><br />
<e><br />
<p><br />
<l>ee</l><br />
<r>ee<s n="vblex"/><s n="pri"/></r><br />
</p><br />
</e><br />
</pardef><br />
<pardef n="prsubj__prn"><br />
<e><br />
<p><br />
<l>I</l><br />
<r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r><br />
</p><br />
</e><br />
</pardef><br />
<br />
</pardefs><br />
<br />
<section id="main" type="standard"><br />
<e lm="gramophone"><i>gramophone</i><par n="gramophone__n"/></e><br />
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e><br />
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e><br />
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e><br />
</section><br />
<br />
<br />
</dictionary><br />
</pre><br />
<br />
<br />
buildall.sh:<br />
<pre><br />
lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
lt-comp rl apertium-sh-en.en.dix en-sh.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin <br />
<br />
apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
<br />
test.sh:<br />
<pre><br />
echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
</pre><br />
<br />
==Discussion==<br />
<br />
output:<br />
<br />
<pre><br />
echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
> gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
> apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
> lt-proc -g sh-en.autogen.bin<br />
#gramophone<br />
<br />
echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
> gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
> apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
> lt-proc -g sh-en.autogen.bin<br />
#prpers #see<br />
<br />
echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
#prpers #see<br />
</pre><br />
<br />
::Can you show the output of each stage ? - [[User:Francis Tyers|Francis Tyers]] 09:20, 30 April 2009 (UTC)<br />
<br />
:::Ok, I see the error. Please change your <code>buildall.sh</code> to:<br />
<br />
<pre><br />
lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
lt-comp rl apertium-sh-en.en.dix sh-en.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin<br />
<br />
lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin <br />
<br />
apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
<br />
- [[User:Francis Tyers|Francis Tyers]] 09:50, 30 April 2009 (UTC)<br />
<br />
<pre><br />
en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6/en-sh$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
> gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
> apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
> lt-proc -g sh-en.autogen.bin<br />
I see gramophones<br />
</pre><br />
[[User:Muki987|Muki987]] 10:57, 30 April 2009 (UTC)<br />
<br />
::Perfect! :) Is that an error in the HOWTO ? Or was it in your script? - [[User:Francis Tyers|Francis Tyers]] 11:50, 30 April 2009 (UTC)<br />
<br />
::: It was the error in the build script. IMHO best would be to add the complete sources including the build script to the web page, to help beginners. It is almost impossible to get it working with the present description, since building is - as far as I can see- not documented. Thanks for your help. [[User:Muki987|Muki987]] 12:41, 30 April 2009 (UTC)<br />
<br />
:::: Every stage is documented. It might be worth giving a link to a "finished piece", but part of the aim of the HOWTO is to get people starting to make the files themselves. Even just typing out something helps in my experience. And I think all of the build steps are given on the page, just not in order. What might be a good idea is to put in all the steps in one place in the end? - [[User:Francis Tyers|Francis Tyers]] 13:08, 30 April 2009 (UTC)<br />
<br />
::: For the sake of precisity I checked:<br />
*necessary to eliminate line <clip pos="1" side="tl" part="lem"/> from t1x file<br />
* necessary to add <br />
<pre><br />
<e><br />
<p><br />
<l>e</l><br />
<r><s n="n"/><s n="pl"/></r><br />
</p><br />
</e><br />
</pre><br />
to sh-en.sh.dix file<br />
* not necessary to modify gramofon__n to gramophone__n in sh-en.en.dix<br />
<br />
[[User:Muki987|Muki987]] 13:09, 30 April 2009 (UTC)<br />
<br />
::::The build thing was an error in the page [http://wiki.apertium.org/w/index.php?title=Apertium_New_Language_Pair_HOWTO&diff=12290&oldid=11933 here]. - [[User:Francis Tyers|Francis Tyers]] 13:10, 30 April 2009 (UTC)<br />
<br />
::: Yes, building was the third error in my original setup (in fact, a build script was not included, but as we see here, it is necessary). [[User:Muki987|Muki987]] 14:13, 30 April 2009 (UTC)<br />
<br />
Regarding the second problem, there is a paragraph of text that mentions it:<br />
<br />
<blockquote><br />
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, just copy the 'e' block for 'i' and change the 'i' to 'e' there. <br />
</blockquote><br />
<br />
Although perhaps I should put in the XML. As far as I can tell the transfer rule is fine, and the error was caused by a typo no? - [[User:Francis Tyers|Francis Tyers]] 15:03, 30 April 2009 (UTC)<br />
<br />
:sloppiness on my side. If all data+build script is available for the user, this kind of errors disappear. [[User:Muki987|Muki987]] 19:03, 30 April 2009 (UTC)<br />
<br />
==Documentation ideas==<br />
:::How about if I just put a pointer to the talk page for copies of the "complete" files? It's just that I think a step-by-step approach really can help people get a grip on the format... or I wouldn't have done it like this :s - [[User:Francis Tyers|Francis Tyers]] 21:08, 30 April 2009 (UTC)<br />
::::That's fine. Also a bit more explanation about what does <clip> do would be good. What does after <clip pos="1" side="tl" part="whatever" mean? I still do not understand that 100%. [[User:Muki987|Muki987]] 20:39, 1 May 2009 (UTC)<br />
<br />
Yes, I wrote the HOWTO when I didn't really understand either ;) ... Basically you define regular your "attributes" in <code>def-attr</code> and <code><clip ...</code> "pulls" these out of the lexical unit. So for example<br />
<br />
<pre><br />
<section-def-attrs><br />
<def-attr n="a_nom"><br />
<attr-item n="n"/><br />
<attr-item n="np"/><br />
<attr-item n="np.top"/><br />
<attr-item n="np.al"/><br />
</def-attr> <br />
<def-attr n="nbr"><br />
<attr-item n="sg"/><br />
<attr-item n="pl"/><br />
</def-attr> <br />
<def-attr n="gen"><br />
<attr-item n="f"/><br />
<attr-item n="m"/><br />
<attr-item n="nt"/><br />
</def-attr> <br />
</section-def-attrs><br />
<br />
<br />
... <br />
<br />
<br />
<clip pos="1" side="tl" part="a_nom"/> <!-- can be 'n', 'np', 'np.al' or 'np.top' , if input stream is ^foo<np>$<br />
it will be <np> and if input stream is ^foo<np><top><m><sg>$ it will be <np><top> --><br />
<clip pos="1" side="tl" part="gen"/> <!-- can be 'f', 'm' or 'nt', if input stream is ^foo<n><m><sg>$ it will be <m> --><br />
<clip pos="1" side="tl" part="nbr"/> <!-- can be 'sg' or 'pl', if input stream is ^foo<n><sg>$ it will be <sg> --><br />
<br />
<br />
</pre><br />
<br />
It is basically for extracting _parts_ of the information out of the lexical unit. - [[User:Francis Tyers|Francis Tyers]] 21:06, 1 May 2009 (UTC)<br />
<br />
:As usual. One can not understand everything immediately. I think, the description is very helpful, at least it helped me a lot. Without it one is completely lost. No practice, no sense, at least for me. Is pos always "1" and side always "tl"? if yes, what is their purpose? [[User:Muki987|Muki987]] 09:49, 2 May 2009 (UTC)<br />
<br />
::side can be "tl" (target language) or "sl" (source language), pos = position, so if you have a rule which matches 4 lexical units, pos="1" will be the first, pos="2" will be the second etc. - [[User:Francis Tyers|Francis Tyers]] 10:57, 2 May 2009 (UTC)<br />
<br />
== faking the tagger (gawk vs wget) ==<br />
<br />
Little suggestion: the gawk script in the Transfer part looks confusing, and newbies won't ever use it again once they have a tagger. How about instead saying<br />
<br />
<br />
We don't have a POS tagger yet, but we can yank one from another language pair for now. It'll perform badly but we can make our own tagger later. First do<br />
<br />
$ wget http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-en-es/en-es.prob -O sh-en.prob<br />
<br />
(alternatively, follow that link and save the file as "sh-en.prob" in your working folder). You can try the tagger with the following command:<br />
<br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | apertium-tagger -p sh-en.prob<br />
<br />
== The 'number' var is not used ==<br />
<br />
The global variable 'number' defined in the transfer rule file doesn't seem to be useful for this tutorial.<br />
--[[User:Grégoire|Grégoire]] 11:48, 17 August 2012 (UTC)</div>Grégoirehttps://wiki.apertium.org/w/index.php?title=Apertium_New_Language_Pair_HOWTO&diff=35837Apertium New Language Pair HOWTO2012-08-17T10:47:23Z<p>Grégoire: Fixing a typo in some file names: the *generation* dictionary for lang xx should be named yy-xx and not xx-yy</p>
<hr />
<div>{{TOCD}}<br />
Apertium New Language Pair HOWTO<br />
<br />
This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch.<br />
<br />
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.)<br />
<br />
==Introduction==<br />
<br />
Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, it's a machine translation platform. It provides an engine and toolbox that allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).<br />
<br />
For a more detailed introduction into how it all works, there are some excellent papers on the [[Publications]] page.<br />
<br />
==You will need==<br />
<br />
* [[lttoolbox]] (>= 3.0.0)<br />
* libxml utils (xmllint etc.)<br />
* apertium (>= 3.0.0)<br />
* a text editor (or a specialised XML editor if you prefer)<br />
<br />
This document will not describe how to install these packages, for more information please see the documentation section of the Apertium website.<br />
<br />
==What does a language pair consist of?==<br />
<br />
Apertium is a shallow-transfer type machine translation system. Thus, it basically works on dictionaries and shallow transfer rules. In operation, shallow-transfer is distinguished from deep-transfer in that it doesn't do full syntactic parsing, the rules are typically operations on groups of lexical units, rather than operations on parse trees. At a basic level, there are three main dictionaries:<br />
# The morphological dictionary for language xx: this contains the rules of how words in language xx are inflected. In our example this will be called: <code>apertium-sh-en.sh.dix</code><br />
# The morphological dictionary for language yy: this contains the rules of how words in language yy are inflected. In our example this will be called: <code>apertium-sh-en.en.dix</code><br />
# Bilingual dictionary: contains correspondences between words and symbols in the two languages. In our example this will be called: <code>apertium-sh-en.sh-en.dix</code><br />
<br />
In a translation pair, both languages can be either source or target for translation, these are relative terms.<br />
<br />
There are also two files for transfer rules. These are the rules that govern how words are re-ordered in sentences, e.g. ''chat noir'' → ''cat black'' → ''black cat''. It also governs agreement of gender, number etc. The rules can also be used to insert or delete lexical items, as will be described later. These files are:<br />
<br />
* language xx to language yy transfer rules: this file contains rules for how language xx will be changed into language yy. In our example this will be: <code>apertium-sh-en.sh-en.t1x</code><br />
* language yy to xx language transfer rules: this file contains rules for how language yy will be changed into language xx. In our example this will be: <code>apertium-sh-en.en-sh.t1x</code><br />
<br />
Many of the language pairs currently available have other files, but we won't cover them here. These files are the only ones required to generate a functional system.<br />
<br />
==Language pair==<br />
<br />
As you may have been alluded to by the file names, this HOWTO will use the example of translating Serbo-Croatian to English to explain how to create a basic system. This is not an ideal pair, since the system works better for more closely related languages. This shouldn't present a problem for the simple examples given here.<br />
<br />
==A brief note on terms==<br />
<br />
There are number of terms that will need to be understood before we continue.<br />
<br />
The first is ''lemma''. A lemma is the citation form of a word. It is the word stripped of any grammatical information. For example, the lemma of the word cats is ''cat''. In English nouns this will typically be the singular form of the word in question. For verbs, the lemma is the infinitive stripped of to, e.g. the lemma of ''was'' would be ''be''.<br />
<br />
The second is ''symbol''. In the context of the Apertium system, symbol refers to a grammatical label. The word cats is a plural noun, therefore it will have the noun symbol and the plural symbol. In the input and output of Apertium modules these are typically given between angle brackets, as follows:<br />
<br />
* <code><n></code>; for noun.<br />
* <code><pl></code>; for plural.<br />
<br />
Other examples of symbols are <sg>; singular, <p1> first person, <pri> present indicative, etc. When written in angle brackets, the symbols may also be referred to as tags. It is worth noting that in many of the currently available language pairs the symbol definitions are acronyms or contractions of words in Catalan. For example, vbhaver — from vb (verb) and haver ("to have" in Catalan). Symbols are defined in <sdef> tags and used in <nowiki><s></nowiki> tags.<br />
<br />
The third word is ''paradigm''. In the context of the Apertium system, paradigm refers to an example of how a particular group of words inflect. In the morphological dictionary, lemmas (see above) are linked to paradigms that allow us to describe how a given lemma inflects without having to write out all of the endings.<br />
<br />
An example of the utility of this is, if we wanted to store the two adjectives ''happy'' and ''lazy'', instead of storing two lots of the same thing:<br />
<br />
* happy, happ (y, ier, iest)<br />
* lazy, laz (y, ier, iest)<br />
<br />
We can simply store one, and then say "lazy, inflects like happy", or indeed "shy inflects like happy", "naughty inflects like happy", "friendly inflects like happy", etc. In this example, happy would be the paradigm, the model for how the others inflect. The precise description of how this is defined will be explained shortly. Paradigms are defined in <pardef> tags, and used in <par> tags.<br />
<br />
==Getting started==<br />
<!-- Ur yezh indezeuropek eo ar brezhoneg --><br />
<br />
===Monolingual dictionaries===<br />
{{see-also|List of dictionaries|Incubator}}<br />
Let's start by making our first source language dictionary, Serbo-Croatian in our example. As mentioned above, this file will be called <code>apertium-sh-en.sh.dix</code>. The dictionary is an XML file. Fire up your text editor and type the following:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<br />
</dictionary><br />
</pre><br />
So, the file so far defines that we want to start a dictionary. In order for it to be useful, we need to add some more entries, the first is an alphabet. This defines the set of letters that may be used in the dictionary, for Serbo-Croatian. It will look something like the following, containing all the letters of the Serbo-Croatian alphabet:<br />
<pre><br />
<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet><br />
</pre><br />
<br />
Place the alphabet below the <dictionary> tag.<br />
<br />
Next we need to define some symbols. Let's start off with the simple stuff, noun (n) in singular (sg) and plural (pl).<br />
<pre><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
</pre><br />
The symbol names do not have to be so small, in fact, they could just be written out in full, but as you'll be typing them a lot, it makes sense to abbreviate.<br />
<br />
Unfortunately, it isn't quite so simple. Nouns in Serbo-Croatian inflect for more than just number, they are also inflected for case, and have a gender. However, we'll assume for the purposes of this example that the noun is masculine and in the nominative case (a full example may be found at the end of this document).<br />
<br />
The next thing is to define a section for the paradigms,<br />
<pre><br />
<pardefs><br />
<br />
</pardefs><br />
</pre><br />
and a dictionary section:<br />
<pre><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</pre><br />
There are two types of sections, the first is a standard section, that contains words, enclitics, etc. The second type is an [[inconditional section]] which typically contains punctuation, and so forth. We don't have an inconditional section here.<br />
<br />
So, our file should now look something like:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<pardefs><br />
<br />
</pardefs><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we've got the skeleton in place, we can start by adding a noun. The noun in question will be 'gramofon' (which means 'gramophone' or 'record player').<br />
<br />
The first thing we need to do, as we have no prior paradigms, is to define a paradigm.<br />
<br />
Remember, we're assuming masculine gender and nominative case. The singular form of the noun is 'gramofon', and the plural is 'gramofoni'. So:<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
Note: the '<l/>' (equivalent to <l></l>) denotes that there is no extra material to be added to the stem for the singular.<br />
<br />
This may seem like a rather verbose way of describing it, but there are reasons for this and it quickly becomes second nature. You're probably wondering what the <e>, <p>, <l> and <r> stand for. Well,<br />
<br />
* e, is for entry.<br />
* p, is for pair.<br />
* l, is for left.<br />
* r, is for right.<br />
<br />
Why left and right? Well, the morphological dictionaries will later be compiled into finite state machines. Compiling them left to right produces analyses from words, and from right to left produces words from analyses. For example:<br />
<pre><br />
* gramofoni (left to right) gramofon<n><pl> (analysis)<br />
* gramofon<n><pl> (right to left) gramofoni (generation)<br />
</pre><br />
Now we've defined a paradigm, we need to link it to its lemma, gramofon. We put this in the section that we've defined.<br />
<br />
The entry to put in the <section> will look like:<br />
<pre><br />
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e><br />
</pre><br />
A quick run down on the abbreviations:<br />
<br />
* lm, is for lemma.<br />
* i, is for identity (the left and the right are the same).<br />
* par, is for paradigm.<br />
<br />
This entry states the lemma of the word, gramofon, the root, gramofon and the paradigm with which it inflects gramofon__n. The difference between the lemma and the root is that the lemma is the citation form of the word, while the root is the substring of the lemma to which suffixes are added. This will become clearer later when we show an entry where the two are different.<br />
<br />
We're now ready to test the dictionary. Save it under <code>apertium-sh-en.sh.dix</code>, and then return to the shell. We first need to compile it (with lt-comp), then we can test it (with lt-proc). For those who are new to cygwin just take note that you need to save the dictionary file inside the home folder (for example C:\Apertium\home\Username\filename_of_dictionary). Otherwise you will not be able to compile.<br />
<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
</pre><br />
Should produce the output:<br />
<pre><br />
main@standard 12 12<br />
</pre><br />
As we are compiling it left to right, we're producing an analyser. Lets make a generator too.<br />
<pre><br />
$ lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin<br />
</pre><br />
At this stage, the command should produce the same output.<br />
<br />
We can now test these. Run lt-proc on the analyser.<br />
<pre><br />
$ lt-proc sh-en.automorf.bin<br />
</pre><br />
Now try it out, type in gramofoni (gramophones), and see the output:<br />
<pre><br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, for the English dictionary, do the same thing, but substitute the English word gramophone for gramofon, and change the plural inflection. What if you want to use the more correct word 'record player'? Well, we'll explain how to do that later.<br />
<br />
You should now have two files in the directory:<br />
<br />
* apertium-sh-en.sh.dix which contains a (very) basic Serbo-Croatian morphological dictionary, and<br />
* apertium-sh-en.en.dix which contains a (very) basic English morphological dictionary.<br />
<br />
===Bilingual dictionary===<br />
<br />
So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. All dictionaries use the same format (which is specified in the DTD, dix.dtd).<br />
<br />
Create a new file, <code>apertium-sh-en.sh-en.dix</code> and add the basic skeleton:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet/><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we need to add an entry to translate between the two words. Something like:<br />
<pre><br />
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e><br />
</pre><br />
Because there are a lot of these entries, they're typically written on one line to facilitate easier reading of the file. Again with the 'l' and 'r' right? Well, we compile it left to right to produce the Serbo-Croatian → English dictionary, and right to left to produce the English → Serbo-Croatian dictionary.<br />
<br />
So, once this is done, run the following commands:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
$ lt-comp rl apertium-sh-en.en.dix en-sh.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
</pre><br />
To generate the morphological analysers (automorf), the morphological generators (autogen) and the word lookups (autobil), the bil is for "bilingual".<br />
<br />
===Transfer rules===<br />
<br />
So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rule files have their own DTD (transfer.dtd) which can be found in the Apertium package. If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages. For example the one described below would be useful for any null-subject language.<br />
<br />
Start out like all the others with a basic skeleton (<code>apertium-sh-en.sh-en.t1x</code>) :<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<transfer><br />
<br />
</transfer><br />
</pre><br />
At the moment, because we're ignoring case, we just need to make a rule that takes the grammatical symbols input and outputs them again.<br />
<br />
We first need to define categories and attributes. Categories and attributes both allow us to group grammatical symbols. Categories allow us to group symbols for the purposes of matching (for example 'n.*' is all nouns). Attributes allow us to group a set of symbols that can be chosen from. For example ('sg' and 'pl' may be grouped a an attribute 'number').<br />
<br />
Lets add the necessary sections:<br />
<pre><br />
<section-def-cats><br />
<br />
</section-def-cats><br />
<section-def-attrs><br />
<br />
</section-def-attrs><br />
</pre><br />
As we're only inflecting, nouns in singular and plural then we need to add a category for nouns, and with an attribute of number. Something like the following will suffice:<br />
<br />
Into section-def-cats add:<br />
<pre><br />
<def-cat n="nom"><br />
<cat-item tags="n.*"/><br />
</def-cat><br />
</pre><br />
This catches all nouns (lemmas followed by <n> then anything) and refers to them as "nom" (we'll see how that's used later).<br />
<br />
Into the section section-def-attrs, add:<br />
<pre><br />
<def-attr n="nbr"><br />
<attr-item tags="sg"/><br />
<attr-item tags="pl"/><br />
</def-attr><br />
</pre><br />
and then<br />
<pre><br />
<def-attr n="a_nom"><br />
<attr-item tags="n"/><br />
</def-attr><br />
</pre><br />
The first defines the attribute nbr (number), which can be either singular (sg) or plural (pl).<br />
<br />
The second defines the attribute a_nom (attribute noun).<br />
<br />
Next we need to add a section for global variables:<br />
<pre><br />
<section-def-vars><br />
<br />
</section-def-vars><br />
</pre><br />
These variables are used to store or transfer attributes between rules. We need only one for now,<br />
<pre><br />
<def-var n="number"/><br />
</pre><br />
Finally, we need to add a rule, to take in the noun and then output it in the correct form. We'll need a rules section...<br />
<pre><br />
<section-rules><br />
<br />
</section-rules><br />
</pre><br />
Changing the pace from the previous examples, I'll just paste this rule, then go through it, rather than the other way round.<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="nom"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_nom"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
<br />
The first tag is obvious, it defines a rule. The second tag, pattern basically says: "apply this rule, if this pattern is found". In this example the pattern consists of a single noun (defined by the category item nom). Note that patterns are matched in a longest-match first. So, say you have three rules, the first catches "<prn><vblex><n>", the second catches "<prn><vblex>" and the third catches "<n>". The pattern matched, and rule executed would be the first one.<br />
<br />
For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).<br />
<br />
The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item.<br />
<br />
Let's compile it and test it. Transfer rules are compiled with:<br />
<pre><br />
$ apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
Which will generate a <code>sh-en.t1x.bin</code> file.<br />
<br />
Now we're ready to test our machine translation system. There is one crucial part missing, the part-of-speech (PoS) tagger, but that will be explained shortly. In the meantime we can test it as is:<br />
<br />
First, lets analyse a word, gramofoni:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin <br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, normally here the POS tagger would choose the right version based on the part of speech, but we don't have a POS tagger yet, so we can use this little gawk script (thanks to Sergio) that will just output the first item retrieved.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}'<br />
^gramofon<n><pl>$<br />
</pre><br />
Now let's process that with the transfer rule:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
</pre><br />
It will output:<br />
<pre><br />
^gramophone<n><pl>$^@<br />
</pre><br />
* 'gramophone' is the target language (side="tl") lemma (lem) at position 1 (pos="1").<br />
* '<n>' is the target language a_nom at position 1.<br />
* '<pl>' is the target language attribute of number (nbr) at position 1.<br />
<br />
Try commenting out one of these clip statements, recompiling and seeing what happens.<br />
<br />
So, now we have the output from the transfer, the only thing that remains is to generate the target-language inflected forms. For this, we use lt-proc, but in generation (-g), not analysis mode.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
<br />
gramophones\@<br />
</pre><br />
And c'est ca. You now have a machine translation system that translates a Serbo-Croatian noun into an English noun. Obviously this isn't very useful, but we'll get onto the more complex stuff soon. Oh, and don't worry about the '@' symbol, I'll explain that soon too.<br />
<br />
Think of a few other words that inflect the same as gramofon. How about adding those. We don't need to add any paradigms, just the entries in the main section of the monolingual and bilingual dictionaries.<br />
<br />
==Bring on the verbs==<br />
<br />
Ok, so we have a system that translates nouns, but that's pretty useless, we want to translate verbs too, and even whole sentences! How about we start with the verb to see. In Serbo-Croatian this is videti. Serbo-Croatian is a null-subject language, this means that it doesn't typically use personal pronouns before the conjugated form of the verb. English is not. So for example: I see in English would be translated as vidim in Serbo-Croatian.<br />
<br />
* Vidim<br />
* see<p1><sg><br />
* I see<br />
<br />
Note: <code><p1></code> denotes first person<br />
<br />
This will be important when we come to write the transfer rule for verbs. Other examples of null-subject languages include: Spanish, Romanian and Polish. This also has the effect that while we only need to add the verb in the Serbo-Croatian morphological dictionary, we need to add both the verb, and the personal pronouns in the English morphological dictionary. We'll go through both of these.<br />
<br />
The other forms of the verb videti are: vidiš, vidi, vidimo, vidite, and vide; which correspond to: you see (singular), he sees, we see, you see (plural), and they see.<br />
<br />
There are two forms of you see, one is plural and formal singular (vidite) and the other is singular and informal (vidiš).<br />
<br />
We're going to try and translate the sentence: "Vidim gramofoni" into "I see gramophones". In the interests of space, we'll just add enough information to do the translation and will leave filling out the paradigms (adding the other conjugations of the verb) as an exercise to the reader.<br />
<br />
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, in the paradigm definition just copy the 'e' block for 'i' and change the 'i' to 'e' there.<br />
<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
<e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
<br />
First thing we need to do is add some more symbols. We need to first add a symbol for 'verb', which we'll call "vblex" (this means lexical verb, as opposed to modal verbs and other types). Verbs have 'person', and 'tense' along with number, so lets add a couple of those as well. We need to translate "I see", so for person we should add "p1", or 'first person', and for tense "pri", or 'present indicative'.<br />
<pre><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
</pre><br />
After we've done this, the same with the nouns, we add a paradigm for the verb conjugation. The first line will be:<br />
<pre><br />
<pardef n="vid/eti__vblex"><br />
</pre><br />
The '/' is used to demarcate where the stems (the parts between the <l> </l> tags) are added to.<br />
<br />
Then the inflection for first person singular:<br />
<pre><br />
<br />
<e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e><br />
<br />
</pre><br />
The 'im' denotes the ending (as in 'vidim'), it is necessary to add 'eti' to the <r> section, as this will be chopped off by the definition. The rest is fairly straightforward, 'vblex' is lexical verb, 'pri' is present indicative tense, 'p1' is first person and 'sg' is singular. We can also add the plural which will be the same, except 'imo' instead of 'im' and 'pl' instead of 'sg'.<br />
<br />
After this we need to add a lemma, paradigm mapping to the main section:<br />
<pre><br />
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e><br />
</pre><br />
Note: the content of <nowiki><i> </i></nowiki> is the root, not the lemma.<br />
<br />
That's the work on the Serbo-Croatian dictionary done for now. Lets compile it then test it.<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
main@standard 23 25<br />
$ echo "vidim" | lt-proc sh-en.automorf.bin<br />
^vidim/videti<vblex><pri><p1><sg>$<br />
$ echo "vidimo" | lt-proc sh-en.automorf.bin<br />
^vidimo/videti<vblex><pri><p1><pl>$<br />
</pre><br />
Ok, so now we do the same for the English dictionary (remember to add the same symbol definitions here as you added to the Serbo-Croatian one).<br />
<br />
The paradigm is:<br />
<pre><br />
<pardef n="s/ee__vblex"><br />
</pre><br />
because the past tense is 'saw'. Now, we can do one of two things, we can add both first and second person, but they are the same form. In fact, all forms (except third person singular) of the verb 'to see' are 'see'. So instead we make one entry for 'see' and give it only the 'pri' symbol.<br />
<pre><br />
<br />
<e><p><l>ee</l><r>ee<s n="vblex"/><s n="pri"/></r></p></e><br />
<br />
</pre><br />
and as always, an entry in the main section:<br />
<pre><br />
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e><br />
</pre><br />
Then lets save, recompile and test:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
main@standard 18 19<br />
<br />
$ echo "see" | lt-proc en-sh.automorf.bin<br />
^see/see<vblex><pri>$<br />
</pre><br />
Now for the obligatory entry in the bilingual dictionary:<br />
<pre><br />
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e><br />
</pre><br />
(again, don't forget to add the sdefs from earlier)<br />
<br />
And recompile:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
main@standard 18 18<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
main@standard 18 18<br />
</pre><br />
Now to test:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
<br />
^see<vblex><pri><p1><sg>$^@<br />
</pre><br />
We get the analysis passed through correctly, but when we try and generate a surface form from this, we get a '#', like below:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
#see\@<br />
</pre><br />
This '#' means that the generator cannot generate the correct lexical form because it does not contain it. Why is this?<br />
<br />
Basically the analyses don't match, the 'see' in the dictionary is see<vblex><pri>, but the see delivered by the transfer is see<vblex><pri><p1><sg>. The Serbo-Croatian side has more information than the English side requires. You can test this by adding the missing symbols to the English dictionary, and then recompiling, and testing again.<br />
<br />
However, a more paradigmatic way of taking care of this is by writing a rule. So, we open up the rules file (<code>apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin</code> in case you forgot).<br />
<br />
We need to add a new category for 'verb'.<br />
<pre><br />
<def-cat n="vrb"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
</pre><br />
We also need to add attributes for tense and for person. We'll make it really simple for now, you can add p2 and p3, but I won't in order to save space.<br />
<pre><br />
<def-attr n="temps"><br />
<attr-item tags="pri"/><br />
</def-attr><br />
<br />
<def-attr n="pers"><br />
<attr-item tags="p1"/><br />
</def-attr><br />
</pre><br />
We should also add an attribute for verbs.<br />
<pre><br />
<def-attr n="a_verb"><br />
<attr-item tags="vblex"/><br />
</def-attr><br />
</pre><br />
Now onto the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
Remember when you tried commenting out the 'clip' tags in the previous rule example and they disappeared from the transfer, well, that's pretty much what we're doing here. We take in a verb with a full analysis, but only output a partial analysis (lemma + verb tag + tense tag).<br />
<br />
So now, if we recompile that, we get:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
^see<vblex><pri>$^@<br />
</pre><br />
and:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see\@<br />
</pre><br />
Try it with 'vidimo' (we see) to see if you get the correct output.<br />
<br />
Now try it with "vidim gramofone":<br />
<pre><br />
$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see gramophones\@<br />
</pre><br />
<br />
==But what about personal pronouns?==<br />
<br />
Well, that's great, but we're still missing the personal pronoun that is necessary in English. In order to add it in, we first need to edit the English morphological dictionary.<br />
<br />
As before, the first thing to do is add the necessary symbols:<br />
<pre><br />
<sdef n="prn"/><br />
<sdef n="subj"/><br />
</pre><br />
Of the two symbols, prn is pronoun, and subj is subject (as in the subject of a sentence).<br />
<br />
Because there is no root, or 'lemma' for personal subject pronouns, we just add the pardef as follows:<br />
<pre><br />
<pardef n="prsubj__prn"><br />
<e><p><l>I</l><r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r></p></e><br />
</pardef><br />
</pre><br />
With 'prsubj' being 'personal subject'. The rest of them (You, We etc.) are left as an exercise to the reader.<br />
<br />
We can add an entry to the main section as follows:<br />
<pre><br />
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e><br />
</pre><br />
So, save, recompile and test, and we should get something like:<br />
<pre><br />
$ echo "I" | lt-proc en-sh.automorf.bin<br />
^I/PRPERS<prn><subj><p1><sg>$<br />
</pre><br />
<br />
(Note: it's in capitals because 'I' is in capitals).<br />
<br />
Now we need to amend the 'verb' rule to output the subject personal pronoun along with the correct verb form.<br />
<br />
First, add a category (this must be getting pretty pedestrian by now):<br />
<pre><br />
<def-cat n="prpers"><br />
<cat-item lemma="prpers" tags="prn.*"/><br />
</def-cat><br />
</pre><br />
Now add the types of pronoun as attributes, we might as well add the 'obj' type as we're at it, although we won't need to use it for now:<br />
<pre><br />
<def-attr n="tipus_prn"><br />
<attr-item tags="prn.subj"/><br />
<attr-item tags="prn.obj"/><br />
</def-attr><br />
</pre><br />
And now to input the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<lit v="prpers"/><br />
<lit-tag v="prn"/><br />
<lit-tag v="subj"/><br />
<clip pos="1" side="tl" part="pers"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
<b/><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
This is pretty much the same rule as before, only we made a couple of small changes.<br />
<br />
We needed to output:<br />
<pre><br />
^prpers<prn><subj><p1><sg>$ ^see<vblex><pri>$<br />
</pre><br />
so that the generator could choose the right pronoun and the right form of the verb.<br />
<br />
So, a quick rundown:<br />
<br />
* <code><lit></code>, prints a literal string, in this case "prpers"<br />
* <code><lit-tag></code>, prints a literal tag, because we can't get the tags from the verb, we add these ourself, "prn" for pronoun, and "subj" for subject.<br />
* <code><b/></code>, prints a blank, a space.<br />
<br />
Note that we retrieved the information for number and tense directly from the verb.<br />
<br />
So, now if we recompile and test that again:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see gramophones<br />
</pre><br />
Which, while it isn't exactly prize-winning prose (much like this HOWTO), is a fairly accurate translation.<br />
<br />
==So tell me about the record player (Multiwords)==<br />
<br />
While gramophone is an English word, it isn't the best translation. Gramophone is typically used for the very old kind, you know with the needle instead of the stylus, and no powered amplification. A better translation would be 'record player'. Although this is more than one word, we can treat it as if it is one word by using multiword (multipalabra) constructions.<br />
<br />
We don't need to touch the Serbo-Croatian dictionary, just the English one and the bilingual one, so open it up.<br />
<br />
The plural of 'record player' is 'record players', so it takes the same paradigm as gramophone (gramophone__n) — in that we just add 's'. All we need to do is add a new element to the main section.<br />
<pre><br />
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e><br />
</pre><br />
The only thing different about this is the use of the <b/> tag, although this isn't entirely new as we saw it in use in the rules file.<br />
<br />
So, recompile and test in the orthodox fashion:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see record players<br />
</pre><br />
Perfect. A big benefit of using multiwords is that you can translate idiomatic expressions verbatim, without having to do word-by-word translation. For example the English phrase, "at the moment" would be translated into Serbo-Croatian as "trenutno" (trenutak = ''moment'', trenutno being adverb of that) &mdash; it would not be possible to translate this English phrase word-by-word into Serbo-Croatian.<br />
<br />
==Dealing with minor variation==<br />
<br />
Serbo-Croatian is an umbrella term for several standard languages, so there are differences in pronounciation and ortography. There is a cool phonetic writing system so you write how you speak. A notable example is the pronounciation of the proto-Slavic vowel ''yat''. The word for dictionary can for instance be either "rječnik" (called Ijekavian), or "rečnik" (called Ekavian).<br />
<br />
===Analysis===<br />
<br />
There should be a fairly easy way of dealing with this, and there is, using paradigms again. Paradigms aren't only used for adding grammatical symbols, but they can also be used to replace any character/symbol with another. For example, here is a paradigm for accepting both "e" and "je" in the analysis. The paradigm should, as with the others go into the monolingual dictionary for Serbo-Croatian.<br />
<br />
<pre><br />
<pardef n="e_je__yat"><br />
<e><br />
<p><br />
<l>e</l><br />
<r>e</r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>je</l><br />
<r>e</r><br />
</p><br />
</e><br />
</pardef><br />
</pre><br />
<br />
Then in the "main section":<br />
<br />
<pre><br />
<e lm="rečnik"><i>r</i><par n="e_je__yat"/><i>čni</i><par n="rečni/k__n"/></e><br />
</pre><br />
<br />
This only allows us to analyse both forms however... more work is necessary if we want to generate both forms.<br />
<br />
===Generation===<br />
<br />
==See also==<br />
<br />
*[[Building dictionaries]]<br />
*[[Cookbook]] <br />
*[[Chunking]]<br />
*[[Contributing to an existing pair]]<br />
<br />
[[Category:Documentation in English]]<br />
[[Category:HOWTO]]<br />
[[Category:Writing dictionaries]]<br />
[[Category:Quickstart]]</div>Grégoirehttps://wiki.apertium.org/w/index.php?title=Apertium_New_Language_Pair_HOWTO&diff=35836Apertium New Language Pair HOWTO2012-08-17T09:54:44Z<p>Grégoire: /* Transfer rules */</p>
<hr />
<div>{{TOCD}}<br />
Apertium New Language Pair HOWTO<br />
<br />
This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch.<br />
<br />
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.)<br />
<br />
==Introduction==<br />
<br />
Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, it's a machine translation platform. It provides an engine and toolbox that allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).<br />
<br />
For a more detailed introduction into how it all works, there are some excellent papers on the [[Publications]] page.<br />
<br />
==You will need==<br />
<br />
* [[lttoolbox]] (>= 3.0.0)<br />
* libxml utils (xmllint etc.)<br />
* apertium (>= 3.0.0)<br />
* a text editor (or a specialised XML editor if you prefer)<br />
<br />
This document will not describe how to install these packages, for more information please see the documentation section of the Apertium website.<br />
<br />
==What does a language pair consist of?==<br />
<br />
Apertium is a shallow-transfer type machine translation system. Thus, it basically works on dictionaries and shallow transfer rules. In operation, shallow-transfer is distinguished from deep-transfer in that it doesn't do full syntactic parsing, the rules are typically operations on groups of lexical units, rather than operations on parse trees. At a basic level, there are three main dictionaries:<br />
# The morphological dictionary for language xx: this contains the rules of how words in language xx are inflected. In our example this will be called: <code>apertium-sh-en.sh.dix</code><br />
# The morphological dictionary for language yy: this contains the rules of how words in language yy are inflected. In our example this will be called: <code>apertium-sh-en.en.dix</code><br />
# Bilingual dictionary: contains correspondences between words and symbols in the two languages. In our example this will be called: <code>apertium-sh-en.sh-en.dix</code><br />
<br />
In a translation pair, both languages can be either source or target for translation, these are relative terms.<br />
<br />
There are also two files for transfer rules. These are the rules that govern how words are re-ordered in sentences, e.g. ''chat noir'' → ''cat black'' → ''black cat''. It also governs agreement of gender, number etc. The rules can also be used to insert or delete lexical items, as will be described later. These files are:<br />
<br />
* language xx to language yy transfer rules: this file contains rules for how language xx will be changed into language yy. In our example this will be: <code>apertium-sh-en.sh-en.t1x</code><br />
* language yy to xx language transfer rules: this file contains rules for how language yy will be changed into language xx. In our example this will be: <code>apertium-sh-en.en-sh.t1x</code><br />
<br />
Many of the language pairs currently available have other files, but we won't cover them here. These files are the only ones required to generate a functional system.<br />
<br />
==Language pair==<br />
<br />
As you may have been alluded to by the file names, this HOWTO will use the example of translating Serbo-Croatian to English to explain how to create a basic system. This is not an ideal pair, since the system works better for more closely related languages. This shouldn't present a problem for the simple examples given here.<br />
<br />
==A brief note on terms==<br />
<br />
There are number of terms that will need to be understood before we continue.<br />
<br />
The first is ''lemma''. A lemma is the citation form of a word. It is the word stripped of any grammatical information. For example, the lemma of the word cats is ''cat''. In English nouns this will typically be the singular form of the word in question. For verbs, the lemma is the infinitive stripped of to, e.g. the lemma of ''was'' would be ''be''.<br />
<br />
The second is ''symbol''. In the context of the Apertium system, symbol refers to a grammatical label. The word cats is a plural noun, therefore it will have the noun symbol and the plural symbol. In the input and output of Apertium modules these are typically given between angle brackets, as follows:<br />
<br />
* <code><n></code>; for noun.<br />
* <code><pl></code>; for plural.<br />
<br />
Other examples of symbols are <sg>; singular, <p1> first person, <pri> present indicative, etc. When written in angle brackets, the symbols may also be referred to as tags. It is worth noting that in many of the currently available language pairs the symbol definitions are acronyms or contractions of words in Catalan. For example, vbhaver — from vb (verb) and haver ("to have" in Catalan). Symbols are defined in <sdef> tags and used in <nowiki><s></nowiki> tags.<br />
<br />
The third word is ''paradigm''. In the context of the Apertium system, paradigm refers to an example of how a particular group of words inflect. In the morphological dictionary, lemmas (see above) are linked to paradigms that allow us to describe how a given lemma inflects without having to write out all of the endings.<br />
<br />
An example of the utility of this is, if we wanted to store the two adjectives ''happy'' and ''lazy'', instead of storing two lots of the same thing:<br />
<br />
* happy, happ (y, ier, iest)<br />
* lazy, laz (y, ier, iest)<br />
<br />
We can simply store one, and then say "lazy, inflects like happy", or indeed "shy inflects like happy", "naughty inflects like happy", "friendly inflects like happy", etc. In this example, happy would be the paradigm, the model for how the others inflect. The precise description of how this is defined will be explained shortly. Paradigms are defined in <pardef> tags, and used in <par> tags.<br />
<br />
==Getting started==<br />
<!-- Ur yezh indezeuropek eo ar brezhoneg --><br />
<br />
===Monolingual dictionaries===<br />
{{see-also|List of dictionaries|Incubator}}<br />
Let's start by making our first source language dictionary, Serbo-Croatian in our example. As mentioned above, this file will be called <code>apertium-sh-en.sh.dix</code>. The dictionary is an XML file. Fire up your text editor and type the following:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<br />
</dictionary><br />
</pre><br />
So, the file so far defines that we want to start a dictionary. In order for it to be useful, we need to add some more entries, the first is an alphabet. This defines the set of letters that may be used in the dictionary, for Serbo-Croatian. It will look something like the following, containing all the letters of the Serbo-Croatian alphabet:<br />
<pre><br />
<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet><br />
</pre><br />
<br />
Place the alphabet below the <dictionary> tag.<br />
<br />
Next we need to define some symbols. Let's start off with the simple stuff, noun (n) in singular (sg) and plural (pl).<br />
<pre><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
</pre><br />
The symbol names do not have to be so small, in fact, they could just be written out in full, but as you'll be typing them a lot, it makes sense to abbreviate.<br />
<br />
Unfortunately, it isn't quite so simple. Nouns in Serbo-Croatian inflect for more than just number, they are also inflected for case, and have a gender. However, we'll assume for the purposes of this example that the noun is masculine and in the nominative case (a full example may be found at the end of this document).<br />
<br />
The next thing is to define a section for the paradigms,<br />
<pre><br />
<pardefs><br />
<br />
</pardefs><br />
</pre><br />
and a dictionary section:<br />
<pre><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</pre><br />
There are two types of sections, the first is a standard section, that contains words, enclitics, etc. The second type is an [[inconditional section]] which typically contains punctuation, and so forth. We don't have an inconditional section here.<br />
<br />
So, our file should now look something like:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<pardefs><br />
<br />
</pardefs><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we've got the skeleton in place, we can start by adding a noun. The noun in question will be 'gramofon' (which means 'gramophone' or 'record player').<br />
<br />
The first thing we need to do, as we have no prior paradigms, is to define a paradigm.<br />
<br />
Remember, we're assuming masculine gender and nominative case. The singular form of the noun is 'gramofon', and the plural is 'gramofoni'. So:<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
Note: the '<l/>' (equivalent to <l></l>) denotes that there is no extra material to be added to the stem for the singular.<br />
<br />
This may seem like a rather verbose way of describing it, but there are reasons for this and it quickly becomes second nature. You're probably wondering what the <e>, <p>, <l> and <r> stand for. Well,<br />
<br />
* e, is for entry.<br />
* p, is for pair.<br />
* l, is for left.<br />
* r, is for right.<br />
<br />
Why left and right? Well, the morphological dictionaries will later be compiled into finite state machines. Compiling them left to right produces analyses from words, and from right to left produces words from analyses. For example:<br />
<pre><br />
* gramofoni (left to right) gramofon<n><pl> (analysis)<br />
* gramofon<n><pl> (right to left) gramofoni (generation)<br />
</pre><br />
Now we've defined a paradigm, we need to link it to its lemma, gramofon. We put this in the section that we've defined.<br />
<br />
The entry to put in the <section> will look like:<br />
<pre><br />
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e><br />
</pre><br />
A quick run down on the abbreviations:<br />
<br />
* lm, is for lemma.<br />
* i, is for identity (the left and the right are the same).<br />
* par, is for paradigm.<br />
<br />
This entry states the lemma of the word, gramofon, the root, gramofon and the paradigm with which it inflects gramofon__n. The difference between the lemma and the root is that the lemma is the citation form of the word, while the root is the substring of the lemma to which suffixes are added. This will become clearer later when we show an entry where the two are different.<br />
<br />
We're now ready to test the dictionary. Save it under <code>apertium-sh-en.sh.dix</code>, and then return to the shell. We first need to compile it (with lt-comp), then we can test it (with lt-proc). For those who are new to cygwin just take note that you need to save the dictionary file inside the home folder (for example C:\Apertium\home\Username\filename_of_dictionary). Otherwise you will not be able to compile.<br />
<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
</pre><br />
Should produce the output:<br />
<pre><br />
main@standard 12 12<br />
</pre><br />
As we are compiling it left to right, we're producing an analyser. Lets make a generator too.<br />
<pre><br />
$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin<br />
</pre><br />
At this stage, the command should produce the same output.<br />
<br />
We can now test these. Run lt-proc on the analyser.<br />
<pre><br />
$ lt-proc sh-en.automorf.bin<br />
</pre><br />
Now try it out, type in gramofoni (gramophones), and see the output:<br />
<pre><br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, for the English dictionary, do the same thing, but substitute the English word gramophone for gramofon, and change the plural inflection. What if you want to use the more correct word 'record player'? Well, we'll explain how to do that later.<br />
<br />
You should now have two files in the directory:<br />
<br />
* apertium-sh-en.sh.dix which contains a (very) basic Serbo-Croatian morphological dictionary, and<br />
* apertium-sh-en.en.dix which contains a (very) basic English morphological dictionary.<br />
<br />
===Bilingual dictionary===<br />
<br />
So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. All dictionaries use the same format (which is specified in the DTD, dix.dtd).<br />
<br />
Create a new file, <code>apertium-sh-en.sh-en.dix</code> and add the basic skeleton:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet/><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we need to add an entry to translate between the two words. Something like:<br />
<pre><br />
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e><br />
</pre><br />
Because there are a lot of these entries, they're typically written on one line to facilitate easier reading of the file. Again with the 'l' and 'r' right? Well, we compile it left to right to produce the Serbo-Croatian → English dictionary, and right to left to produce the English → Serbo-Croatian dictionary.<br />
<br />
So, once this is done, run the following commands:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
$ lt-comp rl apertium-sh-en.en.dix sh-en.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
$ lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
</pre><br />
To generate the morphological analysers (automorf), the morphological generators (autogen) and the word lookups (autobil), the bil is for "bilingual".<br />
<br />
===Transfer rules===<br />
<br />
So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rule files have their own DTD (transfer.dtd) which can be found in the Apertium package. If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages. For example the one described below would be useful for any null-subject language.<br />
<br />
Start out like all the others with a basic skeleton (<code>apertium-sh-en.sh-en.t1x</code>) :<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<transfer><br />
<br />
</transfer><br />
</pre><br />
At the moment, because we're ignoring case, we just need to make a rule that takes the grammatical symbols input and outputs them again.<br />
<br />
We first need to define categories and attributes. Categories and attributes both allow us to group grammatical symbols. Categories allow us to group symbols for the purposes of matching (for example 'n.*' is all nouns). Attributes allow us to group a set of symbols that can be chosen from. For example ('sg' and 'pl' may be grouped a an attribute 'number').<br />
<br />
Lets add the necessary sections:<br />
<pre><br />
<section-def-cats><br />
<br />
</section-def-cats><br />
<section-def-attrs><br />
<br />
</section-def-attrs><br />
</pre><br />
As we're only inflecting, nouns in singular and plural then we need to add a category for nouns, and with an attribute of number. Something like the following will suffice:<br />
<br />
Into section-def-cats add:<br />
<pre><br />
<def-cat n="nom"><br />
<cat-item tags="n.*"/><br />
</def-cat><br />
</pre><br />
This catches all nouns (lemmas followed by <n> then anything) and refers to them as "nom" (we'll see how that's used later).<br />
<br />
Into the section section-def-attrs, add:<br />
<pre><br />
<def-attr n="nbr"><br />
<attr-item tags="sg"/><br />
<attr-item tags="pl"/><br />
</def-attr><br />
</pre><br />
and then<br />
<pre><br />
<def-attr n="a_nom"><br />
<attr-item tags="n"/><br />
</def-attr><br />
</pre><br />
The first defines the attribute nbr (number), which can be either singular (sg) or plural (pl).<br />
<br />
The second defines the attribute a_nom (attribute noun).<br />
<br />
Next we need to add a section for global variables:<br />
<pre><br />
<section-def-vars><br />
<br />
</section-def-vars><br />
</pre><br />
These variables are used to store or transfer attributes between rules. We need only one for now,<br />
<pre><br />
<def-var n="number"/><br />
</pre><br />
Finally, we need to add a rule, to take in the noun and then output it in the correct form. We'll need a rules section...<br />
<pre><br />
<section-rules><br />
<br />
</section-rules><br />
</pre><br />
Changing the pace from the previous examples, I'll just paste this rule, then go through it, rather than the other way round.<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="nom"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_nom"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
<br />
The first tag is obvious, it defines a rule. The second tag, pattern basically says: "apply this rule, if this pattern is found". In this example the pattern consists of a single noun (defined by the category item nom). Note that patterns are matched in a longest-match first. So, say you have three rules, the first catches "<prn><vblex><n>", the second catches "<prn><vblex>" and the third catches "<n>". The pattern matched, and rule executed would be the first one.<br />
<br />
For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).<br />
<br />
The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item.<br />
<br />
Let's compile it and test it. Transfer rules are compiled with:<br />
<pre><br />
$ apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
Which will generate a <code>sh-en.t1x.bin</code> file.<br />
<br />
Now we're ready to test our machine translation system. There is one crucial part missing, the part-of-speech (PoS) tagger, but that will be explained shortly. In the meantime we can test it as is:<br />
<br />
First, lets analyse a word, gramofoni:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin <br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, normally here the POS tagger would choose the right version based on the part of speech, but we don't have a POS tagger yet, so we can use this little gawk script (thanks to Sergio) that will just output the first item retrieved.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}'<br />
^gramofon<n><pl>$<br />
</pre><br />
Now let's process that with the transfer rule:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
</pre><br />
It will output:<br />
<pre><br />
^gramophone<n><pl>$^@<br />
</pre><br />
* 'gramophone' is the target language (side="tl") lemma (lem) at position 1 (pos="1").<br />
* '<n>' is the target language a_nom at position 1.<br />
* '<pl>' is the target language attribute of number (nbr) at position 1.<br />
<br />
Try commenting out one of these clip statements, recompiling and seeing what happens.<br />
<br />
So, now we have the output from the transfer, the only thing that remains is to generate the target-language inflected forms. For this, we use lt-proc, but in generation (-g), not analysis mode.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
<br />
gramophones\@<br />
</pre><br />
And c'est ca. You now have a machine translation system that translates a Serbo-Croatian noun into an English noun. Obviously this isn't very useful, but we'll get onto the more complex stuff soon. Oh, and don't worry about the '@' symbol, I'll explain that soon too.<br />
<br />
Think of a few other words that inflect the same as gramofon. How about adding those. We don't need to add any paradigms, just the entries in the main section of the monolingual and bilingual dictionaries.<br />
<br />
==Bring on the verbs==<br />
<br />
Ok, so we have a system that translates nouns, but that's pretty useless, we want to translate verbs too, and even whole sentences! How about we start with the verb to see. In Serbo-Croatian this is videti. Serbo-Croatian is a null-subject language, this means that it doesn't typically use personal pronouns before the conjugated form of the verb. English is not. So for example: I see in English would be translated as vidim in Serbo-Croatian.<br />
<br />
* Vidim<br />
* see<p1><sg><br />
* I see<br />
<br />
Note: <code><p1></code> denotes first person<br />
<br />
This will be important when we come to write the transfer rule for verbs. Other examples of null-subject languages include: Spanish, Romanian and Polish. This also has the effect that while we only need to add the verb in the Serbo-Croatian morphological dictionary, we need to add both the verb, and the personal pronouns in the English morphological dictionary. We'll go through both of these.<br />
<br />
The other forms of the verb videti are: vidiš, vidi, vidimo, vidite, and vide; which correspond to: you see (singular), he sees, we see, you see (plural), and they see.<br />
<br />
There are two forms of you see, one is plural and formal singular (vidite) and the other is singular and informal (vidiš).<br />
<br />
We're going to try and translate the sentence: "Vidim gramofoni" into "I see gramophones". In the interests of space, we'll just add enough information to do the translation and will leave filling out the paradigms (adding the other conjugations of the verb) as an exercise to the reader.<br />
<br />
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, in the paradigm definition just copy the 'e' block for 'i' and change the 'i' to 'e' there.<br />
<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
<e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
<br />
First thing we need to do is add some more symbols. We need to first add a symbol for 'verb', which we'll call "vblex" (this means lexical verb, as opposed to modal verbs and other types). Verbs have 'person', and 'tense' along with number, so lets add a couple of those as well. We need to translate "I see", so for person we should add "p1", or 'first person', and for tense "pri", or 'present indicative'.<br />
<pre><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
</pre><br />
After we've done this, the same with the nouns, we add a paradigm for the verb conjugation. The first line will be:<br />
<pre><br />
<pardef n="vid/eti__vblex"><br />
</pre><br />
The '/' is used to demarcate where the stems (the parts between the <l> </l> tags) are added to.<br />
<br />
Then the inflection for first person singular:<br />
<pre><br />
<br />
<e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e><br />
<br />
</pre><br />
The 'im' denotes the ending (as in 'vidim'), it is necessary to add 'eti' to the <r> section, as this will be chopped off by the definition. The rest is fairly straightforward, 'vblex' is lexical verb, 'pri' is present indicative tense, 'p1' is first person and 'sg' is singular. We can also add the plural which will be the same, except 'imo' instead of 'im' and 'pl' instead of 'sg'.<br />
<br />
After this we need to add a lemma, paradigm mapping to the main section:<br />
<pre><br />
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e><br />
</pre><br />
Note: the content of <nowiki><i> </i></nowiki> is the root, not the lemma.<br />
<br />
That's the work on the Serbo-Croatian dictionary done for now. Lets compile it then test it.<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
main@standard 23 25<br />
$ echo "vidim" | lt-proc sh-en.automorf.bin<br />
^vidim/videti<vblex><pri><p1><sg>$<br />
$ echo "vidimo" | lt-proc sh-en.automorf.bin<br />
^vidimo/videti<vblex><pri><p1><pl>$<br />
</pre><br />
Ok, so now we do the same for the English dictionary (remember to add the same symbol definitions here as you added to the Serbo-Croatian one).<br />
<br />
The paradigm is:<br />
<pre><br />
<pardef n="s/ee__vblex"><br />
</pre><br />
because the past tense is 'saw'. Now, we can do one of two things, we can add both first and second person, but they are the same form. In fact, all forms (except third person singular) of the verb 'to see' are 'see'. So instead we make one entry for 'see' and give it only the 'pri' symbol.<br />
<pre><br />
<br />
<e><p><l>ee</l><r>ee<s n="vblex"/><s n="pri"/></r></p></e><br />
<br />
</pre><br />
and as always, an entry in the main section:<br />
<pre><br />
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e><br />
</pre><br />
Then lets save, recompile and test:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
main@standard 18 19<br />
<br />
$ echo "see" | lt-proc en-sh.automorf.bin<br />
^see/see<vblex><pri>$<br />
</pre><br />
Now for the obligatory entry in the bilingual dictionary:<br />
<pre><br />
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e><br />
</pre><br />
(again, don't forget to add the sdefs from earlier)<br />
<br />
And recompile:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
main@standard 18 18<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
main@standard 18 18<br />
</pre><br />
Now to test:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
<br />
^see<vblex><pri><p1><sg>$^@<br />
</pre><br />
We get the analysis passed through correctly, but when we try and generate a surface form from this, we get a '#', like below:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
#see\@<br />
</pre><br />
This '#' means that the generator cannot generate the correct lexical form because it does not contain it. Why is this?<br />
<br />
Basically the analyses don't match, the 'see' in the dictionary is see<vblex><pri>, but the see delivered by the transfer is see<vblex><pri><p1><sg>. The Serbo-Croatian side has more information than the English side requires. You can test this by adding the missing symbols to the English dictionary, and then recompiling, and testing again.<br />
<br />
However, a more paradigmatic way of taking care of this is by writing a rule. So, we open up the rules file (<code>apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin</code> in case you forgot).<br />
<br />
We need to add a new category for 'verb'.<br />
<pre><br />
<def-cat n="vrb"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
</pre><br />
We also need to add attributes for tense and for person. We'll make it really simple for now, you can add p2 and p3, but I won't in order to save space.<br />
<pre><br />
<def-attr n="temps"><br />
<attr-item tags="pri"/><br />
</def-attr><br />
<br />
<def-attr n="pers"><br />
<attr-item tags="p1"/><br />
</def-attr><br />
</pre><br />
We should also add an attribute for verbs.<br />
<pre><br />
<def-attr n="a_verb"><br />
<attr-item tags="vblex"/><br />
</def-attr><br />
</pre><br />
Now onto the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
Remember when you tried commenting out the 'clip' tags in the previous rule example and they disappeared from the transfer, well, that's pretty much what we're doing here. We take in a verb with a full analysis, but only output a partial analysis (lemma + verb tag + tense tag).<br />
<br />
So now, if we recompile that, we get:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
^see<vblex><pri>$^@<br />
</pre><br />
and:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see\@<br />
</pre><br />
Try it with 'vidimo' (we see) to see if you get the correct output.<br />
<br />
Now try it with "vidim gramofone":<br />
<pre><br />
$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see gramophones\@<br />
</pre><br />
<br />
==But what about personal pronouns?==<br />
<br />
Well, that's great, but we're still missing the personal pronoun that is necessary in English. In order to add it in, we first need to edit the English morphological dictionary.<br />
<br />
As before, the first thing to do is add the necessary symbols:<br />
<pre><br />
<sdef n="prn"/><br />
<sdef n="subj"/><br />
</pre><br />
Of the two symbols, prn is pronoun, and subj is subject (as in the subject of a sentence).<br />
<br />
Because there is no root, or 'lemma' for personal subject pronouns, we just add the pardef as follows:<br />
<pre><br />
<pardef n="prsubj__prn"><br />
<e><p><l>I</l><r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r></p></e><br />
</pardef><br />
</pre><br />
With 'prsubj' being 'personal subject'. The rest of them (You, We etc.) are left as an exercise to the reader.<br />
<br />
We can add an entry to the main section as follows:<br />
<pre><br />
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e><br />
</pre><br />
So, save, recompile and test, and we should get something like:<br />
<pre><br />
$ echo "I" | lt-proc en-sh.automorf.bin<br />
^I/PRPERS<prn><subj><p1><sg>$<br />
</pre><br />
<br />
(Note: it's in capitals because 'I' is in capitals).<br />
<br />
Now we need to amend the 'verb' rule to output the subject personal pronoun along with the correct verb form.<br />
<br />
First, add a category (this must be getting pretty pedestrian by now):<br />
<pre><br />
<def-cat n="prpers"><br />
<cat-item lemma="prpers" tags="prn.*"/><br />
</def-cat><br />
</pre><br />
Now add the types of pronoun as attributes, we might as well add the 'obj' type as we're at it, although we won't need to use it for now:<br />
<pre><br />
<def-attr n="tipus_prn"><br />
<attr-item tags="prn.subj"/><br />
<attr-item tags="prn.obj"/><br />
</def-attr><br />
</pre><br />
And now to input the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<lit v="prpers"/><br />
<lit-tag v="prn"/><br />
<lit-tag v="subj"/><br />
<clip pos="1" side="tl" part="pers"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
<b/><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
This is pretty much the same rule as before, only we made a couple of small changes.<br />
<br />
We needed to output:<br />
<pre><br />
^prpers<prn><subj><p1><sg>$ ^see<vblex><pri>$<br />
</pre><br />
so that the generator could choose the right pronoun and the right form of the verb.<br />
<br />
So, a quick rundown:<br />
<br />
* <code><lit></code>, prints a literal string, in this case "prpers"<br />
* <code><lit-tag></code>, prints a literal tag, because we can't get the tags from the verb, we add these ourself, "prn" for pronoun, and "subj" for subject.<br />
* <code><b/></code>, prints a blank, a space.<br />
<br />
Note that we retrieved the information for number and tense directly from the verb.<br />
<br />
So, now if we recompile and test that again:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see gramophones<br />
</pre><br />
Which, while it isn't exactly prize-winning prose (much like this HOWTO), is a fairly accurate translation.<br />
<br />
==So tell me about the record player (Multiwords)==<br />
<br />
While gramophone is an English word, it isn't the best translation. Gramophone is typically used for the very old kind, you know with the needle instead of the stylus, and no powered amplification. A better translation would be 'record player'. Although this is more than one word, we can treat it as if it is one word by using multiword (multipalabra) constructions.<br />
<br />
We don't need to touch the Serbo-Croatian dictionary, just the English one and the bilingual one, so open it up.<br />
<br />
The plural of 'record player' is 'record players', so it takes the same paradigm as gramophone (gramophone__n) — in that we just add 's'. All we need to do is add a new element to the main section.<br />
<pre><br />
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e><br />
</pre><br />
The only thing different about this is the use of the <b/> tag, although this isn't entirely new as we saw it in use in the rules file.<br />
<br />
So, recompile and test in the orthodox fashion:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see record players<br />
</pre><br />
Perfect. A big benefit of using multiwords is that you can translate idiomatic expressions verbatim, without having to do word-by-word translation. For example the English phrase, "at the moment" would be translated into Serbo-Croatian as "trenutno" (trenutak = ''moment'', trenutno being adverb of that) &mdash; it would not be possible to translate this English phrase word-by-word into Serbo-Croatian.<br />
<br />
==Dealing with minor variation==<br />
<br />
Serbo-Croatian is an umbrella term for several standard languages, so there are differences in pronounciation and ortography. There is a cool phonetic writing system so you write how you speak. A notable example is the pronounciation of the proto-Slavic vowel ''yat''. The word for dictionary can for instance be either "rječnik" (called Ijekavian), or "rečnik" (called Ekavian).<br />
<br />
===Analysis===<br />
<br />
There should be a fairly easy way of dealing with this, and there is, using paradigms again. Paradigms aren't only used for adding grammatical symbols, but they can also be used to replace any character/symbol with another. For example, here is a paradigm for accepting both "e" and "je" in the analysis. The paradigm should, as with the others go into the monolingual dictionary for Serbo-Croatian.<br />
<br />
<pre><br />
<pardef n="e_je__yat"><br />
<e><br />
<p><br />
<l>e</l><br />
<r>e</r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>je</l><br />
<r>e</r><br />
</p><br />
</e><br />
</pardef><br />
</pre><br />
<br />
Then in the "main section":<br />
<br />
<pre><br />
<e lm="rečnik"><i>r</i><par n="e_je__yat"/><i>čni</i><par n="rečni/k__n"/></e><br />
</pre><br />
<br />
This only allows us to analyse both forms however... more work is necessary if we want to generate both forms.<br />
<br />
===Generation===<br />
<br />
==See also==<br />
<br />
*[[Building dictionaries]]<br />
*[[Cookbook]] <br />
*[[Chunking]]<br />
*[[Contributing to an existing pair]]<br />
<br />
[[Category:Documentation in English]]<br />
[[Category:HOWTO]]<br />
[[Category:Writing dictionaries]]<br />
[[Category:Quickstart]]</div>Grégoirehttps://wiki.apertium.org/w/index.php?title=Apertium_New_Language_Pair_HOWTO&diff=35835Apertium New Language Pair HOWTO2012-08-17T09:51:53Z<p>Grégoire: /* Bilingual dictionary */ Following previous convention</p>
<hr />
<div>{{TOCD}}<br />
Apertium New Language Pair HOWTO<br />
<br />
This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch.<br />
<br />
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.)<br />
<br />
==Introduction==<br />
<br />
Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, it's a machine translation platform. It provides an engine and toolbox that allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).<br />
<br />
For a more detailed introduction into how it all works, there are some excellent papers on the [[Publications]] page.<br />
<br />
==You will need==<br />
<br />
* [[lttoolbox]] (>= 3.0.0)<br />
* libxml utils (xmllint etc.)<br />
* apertium (>= 3.0.0)<br />
* a text editor (or a specialised XML editor if you prefer)<br />
<br />
This document will not describe how to install these packages, for more information please see the documentation section of the Apertium website.<br />
<br />
==What does a language pair consist of?==<br />
<br />
Apertium is a shallow-transfer type machine translation system. Thus, it basically works on dictionaries and shallow transfer rules. In operation, shallow-transfer is distinguished from deep-transfer in that it doesn't do full syntactic parsing, the rules are typically operations on groups of lexical units, rather than operations on parse trees. At a basic level, there are three main dictionaries:<br />
# The morphological dictionary for language xx: this contains the rules of how words in language xx are inflected. In our example this will be called: <code>apertium-sh-en.sh.dix</code><br />
# The morphological dictionary for language yy: this contains the rules of how words in language yy are inflected. In our example this will be called: <code>apertium-sh-en.en.dix</code><br />
# Bilingual dictionary: contains correspondences between words and symbols in the two languages. In our example this will be called: <code>apertium-sh-en.sh-en.dix</code><br />
<br />
In a translation pair, both languages can be either source or target for translation, these are relative terms.<br />
<br />
There are also two files for transfer rules. These are the rules that govern how words are re-ordered in sentences, e.g. ''chat noir'' → ''cat black'' → ''black cat''. It also governs agreement of gender, number etc. The rules can also be used to insert or delete lexical items, as will be described later. These files are:<br />
<br />
* language xx to language yy transfer rules: this file contains rules for how language xx will be changed into language yy. In our example this will be: <code>apertium-sh-en.sh-en.t1x</code><br />
* language yy to xx language transfer rules: this file contains rules for how language yy will be changed into language xx. In our example this will be: <code>apertium-sh-en.en-sh.t1x</code><br />
<br />
Many of the language pairs currently available have other files, but we won't cover them here. These files are the only ones required to generate a functional system.<br />
<br />
==Language pair==<br />
<br />
As you may have been alluded to by the file names, this HOWTO will use the example of translating Serbo-Croatian to English to explain how to create a basic system. This is not an ideal pair, since the system works better for more closely related languages. This shouldn't present a problem for the simple examples given here.<br />
<br />
==A brief note on terms==<br />
<br />
There are number of terms that will need to be understood before we continue.<br />
<br />
The first is ''lemma''. A lemma is the citation form of a word. It is the word stripped of any grammatical information. For example, the lemma of the word cats is ''cat''. In English nouns this will typically be the singular form of the word in question. For verbs, the lemma is the infinitive stripped of to, e.g. the lemma of ''was'' would be ''be''.<br />
<br />
The second is ''symbol''. In the context of the Apertium system, symbol refers to a grammatical label. The word cats is a plural noun, therefore it will have the noun symbol and the plural symbol. In the input and output of Apertium modules these are typically given between angle brackets, as follows:<br />
<br />
* <code><n></code>; for noun.<br />
* <code><pl></code>; for plural.<br />
<br />
Other examples of symbols are <sg>; singular, <p1> first person, <pri> present indicative, etc. When written in angle brackets, the symbols may also be referred to as tags. It is worth noting that in many of the currently available language pairs the symbol definitions are acronyms or contractions of words in Catalan. For example, vbhaver — from vb (verb) and haver ("to have" in Catalan). Symbols are defined in <sdef> tags and used in <nowiki><s></nowiki> tags.<br />
<br />
The third word is ''paradigm''. In the context of the Apertium system, paradigm refers to an example of how a particular group of words inflect. In the morphological dictionary, lemmas (see above) are linked to paradigms that allow us to describe how a given lemma inflects without having to write out all of the endings.<br />
<br />
An example of the utility of this is, if we wanted to store the two adjectives ''happy'' and ''lazy'', instead of storing two lots of the same thing:<br />
<br />
* happy, happ (y, ier, iest)<br />
* lazy, laz (y, ier, iest)<br />
<br />
We can simply store one, and then say "lazy, inflects like happy", or indeed "shy inflects like happy", "naughty inflects like happy", "friendly inflects like happy", etc. In this example, happy would be the paradigm, the model for how the others inflect. The precise description of how this is defined will be explained shortly. Paradigms are defined in <pardef> tags, and used in <par> tags.<br />
<br />
==Getting started==<br />
<!-- Ur yezh indezeuropek eo ar brezhoneg --><br />
<br />
===Monolingual dictionaries===<br />
{{see-also|List of dictionaries|Incubator}}<br />
Let's start by making our first source language dictionary, Serbo-Croatian in our example. As mentioned above, this file will be called <code>apertium-sh-en.sh.dix</code>. The dictionary is an XML file. Fire up your text editor and type the following:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<br />
</dictionary><br />
</pre><br />
So, the file so far defines that we want to start a dictionary. In order for it to be useful, we need to add some more entries, the first is an alphabet. This defines the set of letters that may be used in the dictionary, for Serbo-Croatian. It will look something like the following, containing all the letters of the Serbo-Croatian alphabet:<br />
<pre><br />
<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet><br />
</pre><br />
<br />
Place the alphabet below the <dictionary> tag.<br />
<br />
Next we need to define some symbols. Let's start off with the simple stuff, noun (n) in singular (sg) and plural (pl).<br />
<pre><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
</pre><br />
The symbol names do not have to be so small, in fact, they could just be written out in full, but as you'll be typing them a lot, it makes sense to abbreviate.<br />
<br />
Unfortunately, it isn't quite so simple. Nouns in Serbo-Croatian inflect for more than just number, they are also inflected for case, and have a gender. However, we'll assume for the purposes of this example that the noun is masculine and in the nominative case (a full example may be found at the end of this document).<br />
<br />
The next thing is to define a section for the paradigms,<br />
<pre><br />
<pardefs><br />
<br />
</pardefs><br />
</pre><br />
and a dictionary section:<br />
<pre><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</pre><br />
There are two types of sections, the first is a standard section, that contains words, enclitics, etc. The second type is an [[inconditional section]] which typically contains punctuation, and so forth. We don't have an inconditional section here.<br />
<br />
So, our file should now look something like:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<pardefs><br />
<br />
</pardefs><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we've got the skeleton in place, we can start by adding a noun. The noun in question will be 'gramofon' (which means 'gramophone' or 'record player').<br />
<br />
The first thing we need to do, as we have no prior paradigms, is to define a paradigm.<br />
<br />
Remember, we're assuming masculine gender and nominative case. The singular form of the noun is 'gramofon', and the plural is 'gramofoni'. So:<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
Note: the '<l/>' (equivalent to <l></l>) denotes that there is no extra material to be added to the stem for the singular.<br />
<br />
This may seem like a rather verbose way of describing it, but there are reasons for this and it quickly becomes second nature. You're probably wondering what the <e>, <p>, <l> and <r> stand for. Well,<br />
<br />
* e, is for entry.<br />
* p, is for pair.<br />
* l, is for left.<br />
* r, is for right.<br />
<br />
Why left and right? Well, the morphological dictionaries will later be compiled into finite state machines. Compiling them left to right produces analyses from words, and from right to left produces words from analyses. For example:<br />
<pre><br />
* gramofoni (left to right) gramofon<n><pl> (analysis)<br />
* gramofon<n><pl> (right to left) gramofoni (generation)<br />
</pre><br />
Now we've defined a paradigm, we need to link it to its lemma, gramofon. We put this in the section that we've defined.<br />
<br />
The entry to put in the <section> will look like:<br />
<pre><br />
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e><br />
</pre><br />
A quick run down on the abbreviations:<br />
<br />
* lm, is for lemma.<br />
* i, is for identity (the left and the right are the same).<br />
* par, is for paradigm.<br />
<br />
This entry states the lemma of the word, gramofon, the root, gramofon and the paradigm with which it inflects gramofon__n. The difference between the lemma and the root is that the lemma is the citation form of the word, while the root is the substring of the lemma to which suffixes are added. This will become clearer later when we show an entry where the two are different.<br />
<br />
We're now ready to test the dictionary. Save it under <code>apertium-sh-en.sh.dix</code>, and then return to the shell. We first need to compile it (with lt-comp), then we can test it (with lt-proc). For those who are new to cygwin just take note that you need to save the dictionary file inside the home folder (for example C:\Apertium\home\Username\filename_of_dictionary). Otherwise you will not be able to compile.<br />
<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
</pre><br />
Should produce the output:<br />
<pre><br />
main@standard 12 12<br />
</pre><br />
As we are compiling it left to right, we're producing an analyser. Lets make a generator too.<br />
<pre><br />
$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin<br />
</pre><br />
At this stage, the command should produce the same output.<br />
<br />
We can now test these. Run lt-proc on the analyser.<br />
<pre><br />
$ lt-proc sh-en.automorf.bin<br />
</pre><br />
Now try it out, type in gramofoni (gramophones), and see the output:<br />
<pre><br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, for the English dictionary, do the same thing, but substitute the English word gramophone for gramofon, and change the plural inflection. What if you want to use the more correct word 'record player'? Well, we'll explain how to do that later.<br />
<br />
You should now have two files in the directory:<br />
<br />
* apertium-sh-en.sh.dix which contains a (very) basic Serbo-Croatian morphological dictionary, and<br />
* apertium-sh-en.en.dix which contains a (very) basic English morphological dictionary.<br />
<br />
===Bilingual dictionary===<br />
<br />
So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. All dictionaries use the same format (which is specified in the DTD, dix.dtd).<br />
<br />
Create a new file, <code>apertium-sh-en.sh-en.dix</code> and add the basic skeleton:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet/><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we need to add an entry to translate between the two words. Something like:<br />
<pre><br />
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e><br />
</pre><br />
Because there are a lot of these entries, they're typically written on one line to facilitate easier reading of the file. Again with the 'l' and 'r' right? Well, we compile it left to right to produce the Serbo-Croatian → English dictionary, and right to left to produce the English → Serbo-Croatian dictionary.<br />
<br />
So, once this is done, run the following commands:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
$ lt-comp rl apertium-sh-en.en.dix sh-en.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
$ lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
</pre><br />
To generate the morphological analysers (automorf), the morphological generators (autogen) and the word lookups (autobil), the bil is for "bilingual".<br />
<br />
===Transfer rules===<br />
<br />
So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rule files have their own DTD (transfer.dtd) which can be found in the Apertium package. If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages. For example the one described below would be useful for any null-subject language.<br />
<br />
Start out like all the others with a basic skeleton ( apertium-sh-en.sh-en.t1x ) :<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<transfer><br />
<br />
</transfer><br />
</pre><br />
At the moment, because we're ignoring case, we just need to make a rule that takes the grammatical symbols input and outputs them again.<br />
<br />
We first need to define categories and attributes. Categories and attributes both allow us to group grammatical symbols. Categories allow us to group symbols for the purposes of matching (for example 'n.*' is all nouns). Attributes allow us to group a set of symbols that can be chosen from. For example ('sg' and 'pl' may be grouped a an attribute 'number').<br />
<br />
Lets add the necessary sections:<br />
<pre><br />
<section-def-cats><br />
<br />
</section-def-cats><br />
<section-def-attrs><br />
<br />
</section-def-attrs><br />
</pre><br />
As we're only inflecting, nouns in singular and plural then we need to add a category for nouns, and with an attribute of number. Something like the following will suffice:<br />
<br />
Into section-def-cats add:<br />
<pre><br />
<def-cat n="nom"><br />
<cat-item tags="n.*"/><br />
</def-cat><br />
</pre><br />
This catches all nouns (lemmas followed by <n> then anything) and refers to them as "nom" (we'll see how that's used later).<br />
<br />
Into the section section-def-attrs, add:<br />
<pre><br />
<def-attr n="nbr"><br />
<attr-item tags="sg"/><br />
<attr-item tags="pl"/><br />
</def-attr><br />
</pre><br />
and then<br />
<pre><br />
<def-attr n="a_nom"><br />
<attr-item tags="n"/><br />
</def-attr><br />
</pre><br />
The first defines the attribute nbr (number), which can be either singular (sg) or plural (pl).<br />
<br />
The second defines the attribute a_nom (attribute noun).<br />
<br />
Next we need to add a section for global variables:<br />
<pre><br />
<section-def-vars><br />
<br />
</section-def-vars><br />
</pre><br />
These variables are used to store or transfer attributes between rules. We need only one for now,<br />
<pre><br />
<def-var n="number"/><br />
</pre><br />
Finally, we need to add a rule, to take in the noun and then output it in the correct form. We'll need a rules section...<br />
<pre><br />
<section-rules><br />
<br />
</section-rules><br />
</pre><br />
Changing the pace from the previous examples, I'll just paste this rule, then go through it, rather than the other way round.<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="nom"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_nom"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
<br />
The first tag is obvious, it defines a rule. The second tag, pattern basically says: "apply this rule, if this pattern is found". In this example the pattern consists of a single noun (defined by the category item nom). Note that patterns are matched in a longest-match first. So, say you have three rules, the first catches "<prn><vblex><n>", the second catches "<prn><vblex>" and the third catches "<n>". The pattern matched, and rule executed would be the first one.<br />
<br />
For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).<br />
<br />
The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item.<br />
<br />
Let's compile it and test it. Transfer rules are compiled with:<br />
<pre><br />
$ apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
Which will generate a <code>sh-en.t1x.bin</code> file.<br />
<br />
Now we're ready to test our machine translation system. There is one crucial part missing, the part-of-speech (PoS) tagger, but that will be explained shortly. In the meantime we can test it as is:<br />
<br />
First, lets analyse a word, gramofoni:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin <br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, normally here the POS tagger would choose the right version based on the part of speech, but we don't have a POS tagger yet, so we can use this little gawk script (thanks to Sergio) that will just output the first item retrieved.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}'<br />
^gramofon<n><pl>$<br />
</pre><br />
Now let's process that with the transfer rule:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
</pre><br />
It will output:<br />
<pre><br />
^gramophone<n><pl>$^@<br />
</pre><br />
* 'gramophone' is the target language (side="tl") lemma (lem) at position 1 (pos="1").<br />
* '<n>' is the target language a_nom at position 1.<br />
* '<pl>' is the target language attribute of number (nbr) at position 1.<br />
<br />
Try commenting out one of these clip statements, recompiling and seeing what happens.<br />
<br />
So, now we have the output from the transfer, the only thing that remains is to generate the target-language inflected forms. For this, we use lt-proc, but in generation (-g), not analysis mode.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
<br />
gramophones\@<br />
</pre><br />
And c'est ca. You now have a machine translation system that translates a Serbo-Croatian noun into an English noun. Obviously this isn't very useful, but we'll get onto the more complex stuff soon. Oh, and don't worry about the '@' symbol, I'll explain that soon too.<br />
<br />
Think of a few other words that inflect the same as gramofon. How about adding those. We don't need to add any paradigms, just the entries in the main section of the monolingual and bilingual dictionaries.<br />
<br />
==Bring on the verbs==<br />
<br />
Ok, so we have a system that translates nouns, but that's pretty useless, we want to translate verbs too, and even whole sentences! How about we start with the verb to see. In Serbo-Croatian this is videti. Serbo-Croatian is a null-subject language, this means that it doesn't typically use personal pronouns before the conjugated form of the verb. English is not. So for example: I see in English would be translated as vidim in Serbo-Croatian.<br />
<br />
* Vidim<br />
* see<p1><sg><br />
* I see<br />
<br />
Note: <code><p1></code> denotes first person<br />
<br />
This will be important when we come to write the transfer rule for verbs. Other examples of null-subject languages include: Spanish, Romanian and Polish. This also has the effect that while we only need to add the verb in the Serbo-Croatian morphological dictionary, we need to add both the verb, and the personal pronouns in the English morphological dictionary. We'll go through both of these.<br />
<br />
The other forms of the verb videti are: vidiš, vidi, vidimo, vidite, and vide; which correspond to: you see (singular), he sees, we see, you see (plural), and they see.<br />
<br />
There are two forms of you see, one is plural and formal singular (vidite) and the other is singular and informal (vidiš).<br />
<br />
We're going to try and translate the sentence: "Vidim gramofoni" into "I see gramophones". In the interests of space, we'll just add enough information to do the translation and will leave filling out the paradigms (adding the other conjugations of the verb) as an exercise to the reader.<br />
<br />
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, in the paradigm definition just copy the 'e' block for 'i' and change the 'i' to 'e' there.<br />
<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
<e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
<br />
First thing we need to do is add some more symbols. We need to first add a symbol for 'verb', which we'll call "vblex" (this means lexical verb, as opposed to modal verbs and other types). Verbs have 'person', and 'tense' along with number, so lets add a couple of those as well. We need to translate "I see", so for person we should add "p1", or 'first person', and for tense "pri", or 'present indicative'.<br />
<pre><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
</pre><br />
After we've done this, the same with the nouns, we add a paradigm for the verb conjugation. The first line will be:<br />
<pre><br />
<pardef n="vid/eti__vblex"><br />
</pre><br />
The '/' is used to demarcate where the stems (the parts between the <l> </l> tags) are added to.<br />
<br />
Then the inflection for first person singular:<br />
<pre><br />
<br />
<e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e><br />
<br />
</pre><br />
The 'im' denotes the ending (as in 'vidim'), it is necessary to add 'eti' to the <r> section, as this will be chopped off by the definition. The rest is fairly straightforward, 'vblex' is lexical verb, 'pri' is present indicative tense, 'p1' is first person and 'sg' is singular. We can also add the plural which will be the same, except 'imo' instead of 'im' and 'pl' instead of 'sg'.<br />
<br />
After this we need to add a lemma, paradigm mapping to the main section:<br />
<pre><br />
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e><br />
</pre><br />
Note: the content of <nowiki><i> </i></nowiki> is the root, not the lemma.<br />
<br />
That's the work on the Serbo-Croatian dictionary done for now. Lets compile it then test it.<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
main@standard 23 25<br />
$ echo "vidim" | lt-proc sh-en.automorf.bin<br />
^vidim/videti<vblex><pri><p1><sg>$<br />
$ echo "vidimo" | lt-proc sh-en.automorf.bin<br />
^vidimo/videti<vblex><pri><p1><pl>$<br />
</pre><br />
Ok, so now we do the same for the English dictionary (remember to add the same symbol definitions here as you added to the Serbo-Croatian one).<br />
<br />
The paradigm is:<br />
<pre><br />
<pardef n="s/ee__vblex"><br />
</pre><br />
because the past tense is 'saw'. Now, we can do one of two things, we can add both first and second person, but they are the same form. In fact, all forms (except third person singular) of the verb 'to see' are 'see'. So instead we make one entry for 'see' and give it only the 'pri' symbol.<br />
<pre><br />
<br />
<e><p><l>ee</l><r>ee<s n="vblex"/><s n="pri"/></r></p></e><br />
<br />
</pre><br />
and as always, an entry in the main section:<br />
<pre><br />
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e><br />
</pre><br />
Then lets save, recompile and test:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
main@standard 18 19<br />
<br />
$ echo "see" | lt-proc en-sh.automorf.bin<br />
^see/see<vblex><pri>$<br />
</pre><br />
Now for the obligatory entry in the bilingual dictionary:<br />
<pre><br />
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e><br />
</pre><br />
(again, don't forget to add the sdefs from earlier)<br />
<br />
And recompile:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
main@standard 18 18<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
main@standard 18 18<br />
</pre><br />
Now to test:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
<br />
^see<vblex><pri><p1><sg>$^@<br />
</pre><br />
We get the analysis passed through correctly, but when we try and generate a surface form from this, we get a '#', like below:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
#see\@<br />
</pre><br />
This '#' means that the generator cannot generate the correct lexical form because it does not contain it. Why is this?<br />
<br />
Basically the analyses don't match, the 'see' in the dictionary is see<vblex><pri>, but the see delivered by the transfer is see<vblex><pri><p1><sg>. The Serbo-Croatian side has more information than the English side requires. You can test this by adding the missing symbols to the English dictionary, and then recompiling, and testing again.<br />
<br />
However, a more paradigmatic way of taking care of this is by writing a rule. So, we open up the rules file (<code>apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin</code> in case you forgot).<br />
<br />
We need to add a new category for 'verb'.<br />
<pre><br />
<def-cat n="vrb"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
</pre><br />
We also need to add attributes for tense and for person. We'll make it really simple for now, you can add p2 and p3, but I won't in order to save space.<br />
<pre><br />
<def-attr n="temps"><br />
<attr-item tags="pri"/><br />
</def-attr><br />
<br />
<def-attr n="pers"><br />
<attr-item tags="p1"/><br />
</def-attr><br />
</pre><br />
We should also add an attribute for verbs.<br />
<pre><br />
<def-attr n="a_verb"><br />
<attr-item tags="vblex"/><br />
</def-attr><br />
</pre><br />
Now onto the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
Remember when you tried commenting out the 'clip' tags in the previous rule example and they disappeared from the transfer, well, that's pretty much what we're doing here. We take in a verb with a full analysis, but only output a partial analysis (lemma + verb tag + tense tag).<br />
<br />
So now, if we recompile that, we get:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
^see<vblex><pri>$^@<br />
</pre><br />
and:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see\@<br />
</pre><br />
Try it with 'vidimo' (we see) to see if you get the correct output.<br />
<br />
Now try it with "vidim gramofone":<br />
<pre><br />
$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see gramophones\@<br />
</pre><br />
<br />
==But what about personal pronouns?==<br />
<br />
Well, that's great, but we're still missing the personal pronoun that is necessary in English. In order to add it in, we first need to edit the English morphological dictionary.<br />
<br />
As before, the first thing to do is add the necessary symbols:<br />
<pre><br />
<sdef n="prn"/><br />
<sdef n="subj"/><br />
</pre><br />
Of the two symbols, prn is pronoun, and subj is subject (as in the subject of a sentence).<br />
<br />
Because there is no root, or 'lemma' for personal subject pronouns, we just add the pardef as follows:<br />
<pre><br />
<pardef n="prsubj__prn"><br />
<e><p><l>I</l><r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r></p></e><br />
</pardef><br />
</pre><br />
With 'prsubj' being 'personal subject'. The rest of them (You, We etc.) are left as an exercise to the reader.<br />
<br />
We can add an entry to the main section as follows:<br />
<pre><br />
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e><br />
</pre><br />
So, save, recompile and test, and we should get something like:<br />
<pre><br />
$ echo "I" | lt-proc en-sh.automorf.bin<br />
^I/PRPERS<prn><subj><p1><sg>$<br />
</pre><br />
<br />
(Note: it's in capitals because 'I' is in capitals).<br />
<br />
Now we need to amend the 'verb' rule to output the subject personal pronoun along with the correct verb form.<br />
<br />
First, add a category (this must be getting pretty pedestrian by now):<br />
<pre><br />
<def-cat n="prpers"><br />
<cat-item lemma="prpers" tags="prn.*"/><br />
</def-cat><br />
</pre><br />
Now add the types of pronoun as attributes, we might as well add the 'obj' type as we're at it, although we won't need to use it for now:<br />
<pre><br />
<def-attr n="tipus_prn"><br />
<attr-item tags="prn.subj"/><br />
<attr-item tags="prn.obj"/><br />
</def-attr><br />
</pre><br />
And now to input the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<lit v="prpers"/><br />
<lit-tag v="prn"/><br />
<lit-tag v="subj"/><br />
<clip pos="1" side="tl" part="pers"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
<b/><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
This is pretty much the same rule as before, only we made a couple of small changes.<br />
<br />
We needed to output:<br />
<pre><br />
^prpers<prn><subj><p1><sg>$ ^see<vblex><pri>$<br />
</pre><br />
so that the generator could choose the right pronoun and the right form of the verb.<br />
<br />
So, a quick rundown:<br />
<br />
* <code><lit></code>, prints a literal string, in this case "prpers"<br />
* <code><lit-tag></code>, prints a literal tag, because we can't get the tags from the verb, we add these ourself, "prn" for pronoun, and "subj" for subject.<br />
* <code><b/></code>, prints a blank, a space.<br />
<br />
Note that we retrieved the information for number and tense directly from the verb.<br />
<br />
So, now if we recompile and test that again:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see gramophones<br />
</pre><br />
Which, while it isn't exactly prize-winning prose (much like this HOWTO), is a fairly accurate translation.<br />
<br />
==So tell me about the record player (Multiwords)==<br />
<br />
While gramophone is an English word, it isn't the best translation. Gramophone is typically used for the very old kind, you know with the needle instead of the stylus, and no powered amplification. A better translation would be 'record player'. Although this is more than one word, we can treat it as if it is one word by using multiword (multipalabra) constructions.<br />
<br />
We don't need to touch the Serbo-Croatian dictionary, just the English one and the bilingual one, so open it up.<br />
<br />
The plural of 'record player' is 'record players', so it takes the same paradigm as gramophone (gramophone__n) — in that we just add 's'. All we need to do is add a new element to the main section.<br />
<pre><br />
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e><br />
</pre><br />
The only thing different about this is the use of the <b/> tag, although this isn't entirely new as we saw it in use in the rules file.<br />
<br />
So, recompile and test in the orthodox fashion:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see record players<br />
</pre><br />
Perfect. A big benefit of using multiwords is that you can translate idiomatic expressions verbatim, without having to do word-by-word translation. For example the English phrase, "at the moment" would be translated into Serbo-Croatian as "trenutno" (trenutak = ''moment'', trenutno being adverb of that) &mdash; it would not be possible to translate this English phrase word-by-word into Serbo-Croatian.<br />
<br />
==Dealing with minor variation==<br />
<br />
Serbo-Croatian is an umbrella term for several standard languages, so there are differences in pronounciation and ortography. There is a cool phonetic writing system so you write how you speak. A notable example is the pronounciation of the proto-Slavic vowel ''yat''. The word for dictionary can for instance be either "rječnik" (called Ijekavian), or "rečnik" (called Ekavian).<br />
<br />
===Analysis===<br />
<br />
There should be a fairly easy way of dealing with this, and there is, using paradigms again. Paradigms aren't only used for adding grammatical symbols, but they can also be used to replace any character/symbol with another. For example, here is a paradigm for accepting both "e" and "je" in the analysis. The paradigm should, as with the others go into the monolingual dictionary for Serbo-Croatian.<br />
<br />
<pre><br />
<pardef n="e_je__yat"><br />
<e><br />
<p><br />
<l>e</l><br />
<r>e</r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>je</l><br />
<r>e</r><br />
</p><br />
</e><br />
</pardef><br />
</pre><br />
<br />
Then in the "main section":<br />
<br />
<pre><br />
<e lm="rečnik"><i>r</i><par n="e_je__yat"/><i>čni</i><par n="rečni/k__n"/></e><br />
</pre><br />
<br />
This only allows us to analyse both forms however... more work is necessary if we want to generate both forms.<br />
<br />
===Generation===<br />
<br />
==See also==<br />
<br />
*[[Building dictionaries]]<br />
*[[Cookbook]] <br />
*[[Chunking]]<br />
*[[Contributing to an existing pair]]<br />
<br />
[[Category:Documentation in English]]<br />
[[Category:HOWTO]]<br />
[[Category:Writing dictionaries]]<br />
[[Category:Quickstart]]</div>Grégoirehttps://wiki.apertium.org/w/index.php?title=Apertium_New_Language_Pair_HOWTO&diff=35834Apertium New Language Pair HOWTO2012-08-17T09:38:38Z<p>Grégoire: /* Monolingual dictionaries */ Specify the name of the file in the instructions</p>
<hr />
<div>{{TOCD}}<br />
Apertium New Language Pair HOWTO<br />
<br />
This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch.<br />
<br />
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.)<br />
<br />
==Introduction==<br />
<br />
Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, it's a machine translation platform. It provides an engine and toolbox that allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).<br />
<br />
For a more detailed introduction into how it all works, there are some excellent papers on the [[Publications]] page.<br />
<br />
==You will need==<br />
<br />
* [[lttoolbox]] (>= 3.0.0)<br />
* libxml utils (xmllint etc.)<br />
* apertium (>= 3.0.0)<br />
* a text editor (or a specialised XML editor if you prefer)<br />
<br />
This document will not describe how to install these packages, for more information please see the documentation section of the Apertium website.<br />
<br />
==What does a language pair consist of?==<br />
<br />
Apertium is a shallow-transfer type machine translation system. Thus, it basically works on dictionaries and shallow transfer rules. In operation, shallow-transfer is distinguished from deep-transfer in that it doesn't do full syntactic parsing, the rules are typically operations on groups of lexical units, rather than operations on parse trees. At a basic level, there are three main dictionaries:<br />
# The morphological dictionary for language xx: this contains the rules of how words in language xx are inflected. In our example this will be called: <code>apertium-sh-en.sh.dix</code><br />
# The morphological dictionary for language yy: this contains the rules of how words in language yy are inflected. In our example this will be called: <code>apertium-sh-en.en.dix</code><br />
# Bilingual dictionary: contains correspondences between words and symbols in the two languages. In our example this will be called: <code>apertium-sh-en.sh-en.dix</code><br />
<br />
In a translation pair, both languages can be either source or target for translation, these are relative terms.<br />
<br />
There are also two files for transfer rules. These are the rules that govern how words are re-ordered in sentences, e.g. ''chat noir'' → ''cat black'' → ''black cat''. It also governs agreement of gender, number etc. The rules can also be used to insert or delete lexical items, as will be described later. These files are:<br />
<br />
* language xx to language yy transfer rules: this file contains rules for how language xx will be changed into language yy. In our example this will be: <code>apertium-sh-en.sh-en.t1x</code><br />
* language yy to xx language transfer rules: this file contains rules for how language yy will be changed into language xx. In our example this will be: <code>apertium-sh-en.en-sh.t1x</code><br />
<br />
Many of the language pairs currently available have other files, but we won't cover them here. These files are the only ones required to generate a functional system.<br />
<br />
==Language pair==<br />
<br />
As you may have been alluded to by the file names, this HOWTO will use the example of translating Serbo-Croatian to English to explain how to create a basic system. This is not an ideal pair, since the system works better for more closely related languages. This shouldn't present a problem for the simple examples given here.<br />
<br />
==A brief note on terms==<br />
<br />
There are number of terms that will need to be understood before we continue.<br />
<br />
The first is ''lemma''. A lemma is the citation form of a word. It is the word stripped of any grammatical information. For example, the lemma of the word cats is ''cat''. In English nouns this will typically be the singular form of the word in question. For verbs, the lemma is the infinitive stripped of to, e.g. the lemma of ''was'' would be ''be''.<br />
<br />
The second is ''symbol''. In the context of the Apertium system, symbol refers to a grammatical label. The word cats is a plural noun, therefore it will have the noun symbol and the plural symbol. In the input and output of Apertium modules these are typically given between angle brackets, as follows:<br />
<br />
* <code><n></code>; for noun.<br />
* <code><pl></code>; for plural.<br />
<br />
Other examples of symbols are <sg>; singular, <p1> first person, <pri> present indicative, etc. When written in angle brackets, the symbols may also be referred to as tags. It is worth noting that in many of the currently available language pairs the symbol definitions are acronyms or contractions of words in Catalan. For example, vbhaver — from vb (verb) and haver ("to have" in Catalan). Symbols are defined in <sdef> tags and used in <nowiki><s></nowiki> tags.<br />
<br />
The third word is ''paradigm''. In the context of the Apertium system, paradigm refers to an example of how a particular group of words inflect. In the morphological dictionary, lemmas (see above) are linked to paradigms that allow us to describe how a given lemma inflects without having to write out all of the endings.<br />
<br />
An example of the utility of this is, if we wanted to store the two adjectives ''happy'' and ''lazy'', instead of storing two lots of the same thing:<br />
<br />
* happy, happ (y, ier, iest)<br />
* lazy, laz (y, ier, iest)<br />
<br />
We can simply store one, and then say "lazy, inflects like happy", or indeed "shy inflects like happy", "naughty inflects like happy", "friendly inflects like happy", etc. In this example, happy would be the paradigm, the model for how the others inflect. The precise description of how this is defined will be explained shortly. Paradigms are defined in <pardef> tags, and used in <par> tags.<br />
<br />
==Getting started==<br />
<!-- Ur yezh indezeuropek eo ar brezhoneg --><br />
<br />
===Monolingual dictionaries===<br />
{{see-also|List of dictionaries|Incubator}}<br />
Let's start by making our first source language dictionary, Serbo-Croatian in our example. As mentioned above, this file will be called <code>apertium-sh-en.sh.dix</code>. The dictionary is an XML file. Fire up your text editor and type the following:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<br />
</dictionary><br />
</pre><br />
So, the file so far defines that we want to start a dictionary. In order for it to be useful, we need to add some more entries, the first is an alphabet. This defines the set of letters that may be used in the dictionary, for Serbo-Croatian. It will look something like the following, containing all the letters of the Serbo-Croatian alphabet:<br />
<pre><br />
<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet><br />
</pre><br />
<br />
Place the alphabet below the <dictionary> tag.<br />
<br />
Next we need to define some symbols. Let's start off with the simple stuff, noun (n) in singular (sg) and plural (pl).<br />
<pre><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
</pre><br />
The symbol names do not have to be so small, in fact, they could just be written out in full, but as you'll be typing them a lot, it makes sense to abbreviate.<br />
<br />
Unfortunately, it isn't quite so simple. Nouns in Serbo-Croatian inflect for more than just number, they are also inflected for case, and have a gender. However, we'll assume for the purposes of this example that the noun is masculine and in the nominative case (a full example may be found at the end of this document).<br />
<br />
The next thing is to define a section for the paradigms,<br />
<pre><br />
<pardefs><br />
<br />
</pardefs><br />
</pre><br />
and a dictionary section:<br />
<pre><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</pre><br />
There are two types of sections, the first is a standard section, that contains words, enclitics, etc. The second type is an [[inconditional section]] which typically contains punctuation, and so forth. We don't have an inconditional section here.<br />
<br />
So, our file should now look something like:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<pardefs><br />
<br />
</pardefs><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we've got the skeleton in place, we can start by adding a noun. The noun in question will be 'gramofon' (which means 'gramophone' or 'record player').<br />
<br />
The first thing we need to do, as we have no prior paradigms, is to define a paradigm.<br />
<br />
Remember, we're assuming masculine gender and nominative case. The singular form of the noun is 'gramofon', and the plural is 'gramofoni'. So:<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
Note: the '<l/>' (equivalent to <l></l>) denotes that there is no extra material to be added to the stem for the singular.<br />
<br />
This may seem like a rather verbose way of describing it, but there are reasons for this and it quickly becomes second nature. You're probably wondering what the <e>, <p>, <l> and <r> stand for. Well,<br />
<br />
* e, is for entry.<br />
* p, is for pair.<br />
* l, is for left.<br />
* r, is for right.<br />
<br />
Why left and right? Well, the morphological dictionaries will later be compiled into finite state machines. Compiling them left to right produces analyses from words, and from right to left produces words from analyses. For example:<br />
<pre><br />
* gramofoni (left to right) gramofon<n><pl> (analysis)<br />
* gramofon<n><pl> (right to left) gramofoni (generation)<br />
</pre><br />
Now we've defined a paradigm, we need to link it to its lemma, gramofon. We put this in the section that we've defined.<br />
<br />
The entry to put in the <section> will look like:<br />
<pre><br />
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e><br />
</pre><br />
A quick run down on the abbreviations:<br />
<br />
* lm, is for lemma.<br />
* i, is for identity (the left and the right are the same).<br />
* par, is for paradigm.<br />
<br />
This entry states the lemma of the word, gramofon, the root, gramofon and the paradigm with which it inflects gramofon__n. The difference between the lemma and the root is that the lemma is the citation form of the word, while the root is the substring of the lemma to which suffixes are added. This will become clearer later when we show an entry where the two are different.<br />
<br />
We're now ready to test the dictionary. Save it under <code>apertium-sh-en.sh.dix</code>, and then return to the shell. We first need to compile it (with lt-comp), then we can test it (with lt-proc). For those who are new to cygwin just take note that you need to save the dictionary file inside the home folder (for example C:\Apertium\home\Username\filename_of_dictionary). Otherwise you will not be able to compile.<br />
<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
</pre><br />
Should produce the output:<br />
<pre><br />
main@standard 12 12<br />
</pre><br />
As we are compiling it left to right, we're producing an analyser. Lets make a generator too.<br />
<pre><br />
$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin<br />
</pre><br />
At this stage, the command should produce the same output.<br />
<br />
We can now test these. Run lt-proc on the analyser.<br />
<pre><br />
$ lt-proc sh-en.automorf.bin<br />
</pre><br />
Now try it out, type in gramofoni (gramophones), and see the output:<br />
<pre><br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, for the English dictionary, do the same thing, but substitute the English word gramophone for gramofon, and change the plural inflection. What if you want to use the more correct word 'record player'? Well, we'll explain how to do that later.<br />
<br />
You should now have two files in the directory:<br />
<br />
* apertium-sh-en.sh.dix which contains a (very) basic Serbo-Croatian morphological dictionary, and<br />
* apertium-sh-en.en.dix which contains a (very) basic English morphological dictionary.<br />
<br />
===Bilingual dictionary===<br />
<br />
So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. All dictionaries use the same format (which is specified in the DTD, dix.dtd).<br />
<br />
Create a new file, apertium-sh-en.sh-en.dix and add the basic skeleton:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet/><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we need to add an entry to translate between the two words. Something like:<br />
<pre><br />
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e><br />
</pre><br />
Because there are a lot of these entries, they're typically written on one line to facilitate easier reading of the file. Again with the 'l' and 'r' right? Well, we compile it left to right to produce the Serbo-Croatian → English dictionary, and right to left to produce the English → Serbo-Croatian dictionary.<br />
<br />
So, once this is done, run the following commands:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
$ lt-comp rl apertium-sh-en.en.dix sh-en.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
$ lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
</pre><br />
To generate the morphological analysers (automorf), the morphological generators (autogen) and the word lookups (autobil), the bil is for "bilingual".<br />
<br />
===Transfer rules===<br />
<br />
So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rule files have their own DTD (transfer.dtd) which can be found in the Apertium package. If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages. For example the one described below would be useful for any null-subject language.<br />
<br />
Start out like all the others with a basic skeleton ( apertium-sh-en.sh-en.t1x ) :<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<transfer><br />
<br />
</transfer><br />
</pre><br />
At the moment, because we're ignoring case, we just need to make a rule that takes the grammatical symbols input and outputs them again.<br />
<br />
We first need to define categories and attributes. Categories and attributes both allow us to group grammatical symbols. Categories allow us to group symbols for the purposes of matching (for example 'n.*' is all nouns). Attributes allow us to group a set of symbols that can be chosen from. For example ('sg' and 'pl' may be grouped a an attribute 'number').<br />
<br />
Lets add the necessary sections:<br />
<pre><br />
<section-def-cats><br />
<br />
</section-def-cats><br />
<section-def-attrs><br />
<br />
</section-def-attrs><br />
</pre><br />
As we're only inflecting, nouns in singular and plural then we need to add a category for nouns, and with an attribute of number. Something like the following will suffice:<br />
<br />
Into section-def-cats add:<br />
<pre><br />
<def-cat n="nom"><br />
<cat-item tags="n.*"/><br />
</def-cat><br />
</pre><br />
This catches all nouns (lemmas followed by <n> then anything) and refers to them as "nom" (we'll see how that's used later).<br />
<br />
Into the section section-def-attrs, add:<br />
<pre><br />
<def-attr n="nbr"><br />
<attr-item tags="sg"/><br />
<attr-item tags="pl"/><br />
</def-attr><br />
</pre><br />
and then<br />
<pre><br />
<def-attr n="a_nom"><br />
<attr-item tags="n"/><br />
</def-attr><br />
</pre><br />
The first defines the attribute nbr (number), which can be either singular (sg) or plural (pl).<br />
<br />
The second defines the attribute a_nom (attribute noun).<br />
<br />
Next we need to add a section for global variables:<br />
<pre><br />
<section-def-vars><br />
<br />
</section-def-vars><br />
</pre><br />
These variables are used to store or transfer attributes between rules. We need only one for now,<br />
<pre><br />
<def-var n="number"/><br />
</pre><br />
Finally, we need to add a rule, to take in the noun and then output it in the correct form. We'll need a rules section...<br />
<pre><br />
<section-rules><br />
<br />
</section-rules><br />
</pre><br />
Changing the pace from the previous examples, I'll just paste this rule, then go through it, rather than the other way round.<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="nom"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_nom"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
<br />
The first tag is obvious, it defines a rule. The second tag, pattern basically says: "apply this rule, if this pattern is found". In this example the pattern consists of a single noun (defined by the category item nom). Note that patterns are matched in a longest-match first. So, say you have three rules, the first catches "<prn><vblex><n>", the second catches "<prn><vblex>" and the third catches "<n>". The pattern matched, and rule executed would be the first one.<br />
<br />
For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).<br />
<br />
The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item.<br />
<br />
Let's compile it and test it. Transfer rules are compiled with:<br />
<pre><br />
$ apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
Which will generate a <code>sh-en.t1x.bin</code> file.<br />
<br />
Now we're ready to test our machine translation system. There is one crucial part missing, the part-of-speech (PoS) tagger, but that will be explained shortly. In the meantime we can test it as is:<br />
<br />
First, lets analyse a word, gramofoni:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin <br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, normally here the POS tagger would choose the right version based on the part of speech, but we don't have a POS tagger yet, so we can use this little gawk script (thanks to Sergio) that will just output the first item retrieved.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}'<br />
^gramofon<n><pl>$<br />
</pre><br />
Now let's process that with the transfer rule:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
</pre><br />
It will output:<br />
<pre><br />
^gramophone<n><pl>$^@<br />
</pre><br />
* 'gramophone' is the target language (side="tl") lemma (lem) at position 1 (pos="1").<br />
* '<n>' is the target language a_nom at position 1.<br />
* '<pl>' is the target language attribute of number (nbr) at position 1.<br />
<br />
Try commenting out one of these clip statements, recompiling and seeing what happens.<br />
<br />
So, now we have the output from the transfer, the only thing that remains is to generate the target-language inflected forms. For this, we use lt-proc, but in generation (-g), not analysis mode.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
<br />
gramophones\@<br />
</pre><br />
And c'est ca. You now have a machine translation system that translates a Serbo-Croatian noun into an English noun. Obviously this isn't very useful, but we'll get onto the more complex stuff soon. Oh, and don't worry about the '@' symbol, I'll explain that soon too.<br />
<br />
Think of a few other words that inflect the same as gramofon. How about adding those. We don't need to add any paradigms, just the entries in the main section of the monolingual and bilingual dictionaries.<br />
<br />
==Bring on the verbs==<br />
<br />
Ok, so we have a system that translates nouns, but that's pretty useless, we want to translate verbs too, and even whole sentences! How about we start with the verb to see. In Serbo-Croatian this is videti. Serbo-Croatian is a null-subject language, this means that it doesn't typically use personal pronouns before the conjugated form of the verb. English is not. So for example: I see in English would be translated as vidim in Serbo-Croatian.<br />
<br />
* Vidim<br />
* see<p1><sg><br />
* I see<br />
<br />
Note: <code><p1></code> denotes first person<br />
<br />
This will be important when we come to write the transfer rule for verbs. Other examples of null-subject languages include: Spanish, Romanian and Polish. This also has the effect that while we only need to add the verb in the Serbo-Croatian morphological dictionary, we need to add both the verb, and the personal pronouns in the English morphological dictionary. We'll go through both of these.<br />
<br />
The other forms of the verb videti are: vidiš, vidi, vidimo, vidite, and vide; which correspond to: you see (singular), he sees, we see, you see (plural), and they see.<br />
<br />
There are two forms of you see, one is plural and formal singular (vidite) and the other is singular and informal (vidiš).<br />
<br />
We're going to try and translate the sentence: "Vidim gramofoni" into "I see gramophones". In the interests of space, we'll just add enough information to do the translation and will leave filling out the paradigms (adding the other conjugations of the verb) as an exercise to the reader.<br />
<br />
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, in the paradigm definition just copy the 'e' block for 'i' and change the 'i' to 'e' there.<br />
<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
<e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
<br />
First thing we need to do is add some more symbols. We need to first add a symbol for 'verb', which we'll call "vblex" (this means lexical verb, as opposed to modal verbs and other types). Verbs have 'person', and 'tense' along with number, so lets add a couple of those as well. We need to translate "I see", so for person we should add "p1", or 'first person', and for tense "pri", or 'present indicative'.<br />
<pre><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
</pre><br />
After we've done this, the same with the nouns, we add a paradigm for the verb conjugation. The first line will be:<br />
<pre><br />
<pardef n="vid/eti__vblex"><br />
</pre><br />
The '/' is used to demarcate where the stems (the parts between the <l> </l> tags) are added to.<br />
<br />
Then the inflection for first person singular:<br />
<pre><br />
<br />
<e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e><br />
<br />
</pre><br />
The 'im' denotes the ending (as in 'vidim'), it is necessary to add 'eti' to the <r> section, as this will be chopped off by the definition. The rest is fairly straightforward, 'vblex' is lexical verb, 'pri' is present indicative tense, 'p1' is first person and 'sg' is singular. We can also add the plural which will be the same, except 'imo' instead of 'im' and 'pl' instead of 'sg'.<br />
<br />
After this we need to add a lemma, paradigm mapping to the main section:<br />
<pre><br />
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e><br />
</pre><br />
Note: the content of <nowiki><i> </i></nowiki> is the root, not the lemma.<br />
<br />
That's the work on the Serbo-Croatian dictionary done for now. Lets compile it then test it.<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
main@standard 23 25<br />
$ echo "vidim" | lt-proc sh-en.automorf.bin<br />
^vidim/videti<vblex><pri><p1><sg>$<br />
$ echo "vidimo" | lt-proc sh-en.automorf.bin<br />
^vidimo/videti<vblex><pri><p1><pl>$<br />
</pre><br />
Ok, so now we do the same for the English dictionary (remember to add the same symbol definitions here as you added to the Serbo-Croatian one).<br />
<br />
The paradigm is:<br />
<pre><br />
<pardef n="s/ee__vblex"><br />
</pre><br />
because the past tense is 'saw'. Now, we can do one of two things, we can add both first and second person, but they are the same form. In fact, all forms (except third person singular) of the verb 'to see' are 'see'. So instead we make one entry for 'see' and give it only the 'pri' symbol.<br />
<pre><br />
<br />
<e><p><l>ee</l><r>ee<s n="vblex"/><s n="pri"/></r></p></e><br />
<br />
</pre><br />
and as always, an entry in the main section:<br />
<pre><br />
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e><br />
</pre><br />
Then lets save, recompile and test:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
main@standard 18 19<br />
<br />
$ echo "see" | lt-proc en-sh.automorf.bin<br />
^see/see<vblex><pri>$<br />
</pre><br />
Now for the obligatory entry in the bilingual dictionary:<br />
<pre><br />
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e><br />
</pre><br />
(again, don't forget to add the sdefs from earlier)<br />
<br />
And recompile:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
main@standard 18 18<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
main@standard 18 18<br />
</pre><br />
Now to test:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
<br />
^see<vblex><pri><p1><sg>$^@<br />
</pre><br />
We get the analysis passed through correctly, but when we try and generate a surface form from this, we get a '#', like below:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
#see\@<br />
</pre><br />
This '#' means that the generator cannot generate the correct lexical form because it does not contain it. Why is this?<br />
<br />
Basically the analyses don't match, the 'see' in the dictionary is see<vblex><pri>, but the see delivered by the transfer is see<vblex><pri><p1><sg>. The Serbo-Croatian side has more information than the English side requires. You can test this by adding the missing symbols to the English dictionary, and then recompiling, and testing again.<br />
<br />
However, a more paradigmatic way of taking care of this is by writing a rule. So, we open up the rules file (<code>apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin</code> in case you forgot).<br />
<br />
We need to add a new category for 'verb'.<br />
<pre><br />
<def-cat n="vrb"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
</pre><br />
We also need to add attributes for tense and for person. We'll make it really simple for now, you can add p2 and p3, but I won't in order to save space.<br />
<pre><br />
<def-attr n="temps"><br />
<attr-item tags="pri"/><br />
</def-attr><br />
<br />
<def-attr n="pers"><br />
<attr-item tags="p1"/><br />
</def-attr><br />
</pre><br />
We should also add an attribute for verbs.<br />
<pre><br />
<def-attr n="a_verb"><br />
<attr-item tags="vblex"/><br />
</def-attr><br />
</pre><br />
Now onto the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
Remember when you tried commenting out the 'clip' tags in the previous rule example and they disappeared from the transfer, well, that's pretty much what we're doing here. We take in a verb with a full analysis, but only output a partial analysis (lemma + verb tag + tense tag).<br />
<br />
So now, if we recompile that, we get:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
^see<vblex><pri>$^@<br />
</pre><br />
and:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see\@<br />
</pre><br />
Try it with 'vidimo' (we see) to see if you get the correct output.<br />
<br />
Now try it with "vidim gramofone":<br />
<pre><br />
$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see gramophones\@<br />
</pre><br />
<br />
==But what about personal pronouns?==<br />
<br />
Well, that's great, but we're still missing the personal pronoun that is necessary in English. In order to add it in, we first need to edit the English morphological dictionary.<br />
<br />
As before, the first thing to do is add the necessary symbols:<br />
<pre><br />
<sdef n="prn"/><br />
<sdef n="subj"/><br />
</pre><br />
Of the two symbols, prn is pronoun, and subj is subject (as in the subject of a sentence).<br />
<br />
Because there is no root, or 'lemma' for personal subject pronouns, we just add the pardef as follows:<br />
<pre><br />
<pardef n="prsubj__prn"><br />
<e><p><l>I</l><r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r></p></e><br />
</pardef><br />
</pre><br />
With 'prsubj' being 'personal subject'. The rest of them (You, We etc.) are left as an exercise to the reader.<br />
<br />
We can add an entry to the main section as follows:<br />
<pre><br />
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e><br />
</pre><br />
So, save, recompile and test, and we should get something like:<br />
<pre><br />
$ echo "I" | lt-proc en-sh.automorf.bin<br />
^I/PRPERS<prn><subj><p1><sg>$<br />
</pre><br />
<br />
(Note: it's in capitals because 'I' is in capitals).<br />
<br />
Now we need to amend the 'verb' rule to output the subject personal pronoun along with the correct verb form.<br />
<br />
First, add a category (this must be getting pretty pedestrian by now):<br />
<pre><br />
<def-cat n="prpers"><br />
<cat-item lemma="prpers" tags="prn.*"/><br />
</def-cat><br />
</pre><br />
Now add the types of pronoun as attributes, we might as well add the 'obj' type as we're at it, although we won't need to use it for now:<br />
<pre><br />
<def-attr n="tipus_prn"><br />
<attr-item tags="prn.subj"/><br />
<attr-item tags="prn.obj"/><br />
</def-attr><br />
</pre><br />
And now to input the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<lit v="prpers"/><br />
<lit-tag v="prn"/><br />
<lit-tag v="subj"/><br />
<clip pos="1" side="tl" part="pers"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
<b/><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
This is pretty much the same rule as before, only we made a couple of small changes.<br />
<br />
We needed to output:<br />
<pre><br />
^prpers<prn><subj><p1><sg>$ ^see<vblex><pri>$<br />
</pre><br />
so that the generator could choose the right pronoun and the right form of the verb.<br />
<br />
So, a quick rundown:<br />
<br />
* <code><lit></code>, prints a literal string, in this case "prpers"<br />
* <code><lit-tag></code>, prints a literal tag, because we can't get the tags from the verb, we add these ourself, "prn" for pronoun, and "subj" for subject.<br />
* <code><b/></code>, prints a blank, a space.<br />
<br />
Note that we retrieved the information for number and tense directly from the verb.<br />
<br />
So, now if we recompile and test that again:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see gramophones<br />
</pre><br />
Which, while it isn't exactly prize-winning prose (much like this HOWTO), is a fairly accurate translation.<br />
<br />
==So tell me about the record player (Multiwords)==<br />
<br />
While gramophone is an English word, it isn't the best translation. Gramophone is typically used for the very old kind, you know with the needle instead of the stylus, and no powered amplification. A better translation would be 'record player'. Although this is more than one word, we can treat it as if it is one word by using multiword (multipalabra) constructions.<br />
<br />
We don't need to touch the Serbo-Croatian dictionary, just the English one and the bilingual one, so open it up.<br />
<br />
The plural of 'record player' is 'record players', so it takes the same paradigm as gramophone (gramophone__n) — in that we just add 's'. All we need to do is add a new element to the main section.<br />
<pre><br />
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e><br />
</pre><br />
The only thing different about this is the use of the <b/> tag, although this isn't entirely new as we saw it in use in the rules file.<br />
<br />
So, recompile and test in the orthodox fashion:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see record players<br />
</pre><br />
Perfect. A big benefit of using multiwords is that you can translate idiomatic expressions verbatim, without having to do word-by-word translation. For example the English phrase, "at the moment" would be translated into Serbo-Croatian as "trenutno" (trenutak = ''moment'', trenutno being adverb of that) &mdash; it would not be possible to translate this English phrase word-by-word into Serbo-Croatian.<br />
<br />
==Dealing with minor variation==<br />
<br />
Serbo-Croatian is an umbrella term for several standard languages, so there are differences in pronounciation and ortography. There is a cool phonetic writing system so you write how you speak. A notable example is the pronounciation of the proto-Slavic vowel ''yat''. The word for dictionary can for instance be either "rječnik" (called Ijekavian), or "rečnik" (called Ekavian).<br />
<br />
===Analysis===<br />
<br />
There should be a fairly easy way of dealing with this, and there is, using paradigms again. Paradigms aren't only used for adding grammatical symbols, but they can also be used to replace any character/symbol with another. For example, here is a paradigm for accepting both "e" and "je" in the analysis. The paradigm should, as with the others go into the monolingual dictionary for Serbo-Croatian.<br />
<br />
<pre><br />
<pardef n="e_je__yat"><br />
<e><br />
<p><br />
<l>e</l><br />
<r>e</r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>je</l><br />
<r>e</r><br />
</p><br />
</e><br />
</pardef><br />
</pre><br />
<br />
Then in the "main section":<br />
<br />
<pre><br />
<e lm="rečnik"><i>r</i><par n="e_je__yat"/><i>čni</i><par n="rečni/k__n"/></e><br />
</pre><br />
<br />
This only allows us to analyse both forms however... more work is necessary if we want to generate both forms.<br />
<br />
===Generation===<br />
<br />
==See also==<br />
<br />
*[[Building dictionaries]]<br />
*[[Cookbook]] <br />
*[[Chunking]]<br />
*[[Contributing to an existing pair]]<br />
<br />
[[Category:Documentation in English]]<br />
[[Category:HOWTO]]<br />
[[Category:Writing dictionaries]]<br />
[[Category:Quickstart]]</div>Grégoirehttps://wiki.apertium.org/w/index.php?title=Apertium_New_Language_Pair_HOWTO&diff=35831Apertium New Language Pair HOWTO2012-08-17T08:54:54Z<p>Grégoire: /* What does a language pair consist of? */ Typographics details</p>
<hr />
<div>{{TOCD}}<br />
Apertium New Language Pair HOWTO<br />
<br />
This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch.<br />
<br />
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.)<br />
<br />
==Introduction==<br />
<br />
Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, it's a machine translation platform. It provides an engine and toolbox that allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).<br />
<br />
For a more detailed introduction into how it all works, there are some excellent papers on the [[Publications]] page.<br />
<br />
==You will need==<br />
<br />
* [[lttoolbox]] (>= 3.0.0)<br />
* libxml utils (xmllint etc.)<br />
* apertium (>= 3.0.0)<br />
* a text editor (or a specialised XML editor if you prefer)<br />
<br />
This document will not describe how to install these packages, for more information please see the documentation section of the Apertium website.<br />
<br />
==What does a language pair consist of?==<br />
<br />
Apertium is a shallow-transfer type machine translation system. Thus, it basically works on dictionaries and shallow transfer rules. In operation, shallow-transfer is distinguished from deep-transfer in that it doesn't do full syntactic parsing, the rules are typically operations on groups of lexical units, rather than operations on parse trees. At a basic level, there are three main dictionaries:<br />
# The morphological dictionary for language xx: this contains the rules of how words in language xx are inflected. In our example this will be called: <code>apertium-sh-en.sh.dix</code><br />
# The morphological dictionary for language yy: this contains the rules of how words in language yy are inflected. In our example this will be called: <code>apertium-sh-en.en.dix</code><br />
# Bilingual dictionary: contains correspondences between words and symbols in the two languages. In our example this will be called: <code>apertium-sh-en.sh-en.dix</code><br />
<br />
In a translation pair, both languages can be either source or target for translation, these are relative terms.<br />
<br />
There are also two files for transfer rules. These are the rules that govern how words are re-ordered in sentences, e.g. ''chat noir'' → ''cat black'' → ''black cat''. It also governs agreement of gender, number etc. The rules can also be used to insert or delete lexical items, as will be described later. These files are:<br />
<br />
* language xx to language yy transfer rules: this file contains rules for how language xx will be changed into language yy. In our example this will be: <code>apertium-sh-en.sh-en.t1x</code><br />
* language yy to xx language transfer rules: this file contains rules for how language yy will be changed into language xx. In our example this will be: <code>apertium-sh-en.en-sh.t1x</code><br />
<br />
Many of the language pairs currently available have other files, but we won't cover them here. These files are the only ones required to generate a functional system.<br />
<br />
==Language pair==<br />
<br />
As you may have been alluded to by the file names, this HOWTO will use the example of translating Serbo-Croatian to English to explain how to create a basic system. This is not an ideal pair, since the system works better for more closely related languages. This shouldn't present a problem for the simple examples given here.<br />
<br />
==A brief note on terms==<br />
<br />
There are number of terms that will need to be understood before we continue.<br />
<br />
The first is ''lemma''. A lemma is the citation form of a word. It is the word stripped of any grammatical information. For example, the lemma of the word cats is ''cat''. In English nouns this will typically be the singular form of the word in question. For verbs, the lemma is the infinitive stripped of to, e.g. the lemma of ''was'' would be ''be''.<br />
<br />
The second is ''symbol''. In the context of the Apertium system, symbol refers to a grammatical label. The word cats is a plural noun, therefore it will have the noun symbol and the plural symbol. In the input and output of Apertium modules these are typically given between angle brackets, as follows:<br />
<br />
* <code><n></code>; for noun.<br />
* <code><pl></code>; for plural.<br />
<br />
Other examples of symbols are <sg>; singular, <p1> first person, <pri> present indicative, etc. When written in angle brackets, the symbols may also be referred to as tags. It is worth noting that in many of the currently available language pairs the symbol definitions are acronyms or contractions of words in Catalan. For example, vbhaver — from vb (verb) and haver ("to have" in Catalan). Symbols are defined in <sdef> tags and used in <nowiki><s></nowiki> tags.<br />
<br />
The third word is ''paradigm''. In the context of the Apertium system, paradigm refers to an example of how a particular group of words inflect. In the morphological dictionary, lemmas (see above) are linked to paradigms that allow us to describe how a given lemma inflects without having to write out all of the endings.<br />
<br />
An example of the utility of this is, if we wanted to store the two adjectives ''happy'' and ''lazy'', instead of storing two lots of the same thing:<br />
<br />
* happy, happ (y, ier, iest)<br />
* lazy, laz (y, ier, iest)<br />
<br />
We can simply store one, and then say "lazy, inflects like happy", or indeed "shy inflects like happy", "naughty inflects like happy", "friendly inflects like happy", etc. In this example, happy would be the paradigm, the model for how the others inflect. The precise description of how this is defined will be explained shortly. Paradigms are defined in <pardef> tags, and used in <par> tags.<br />
<br />
==Getting started==<br />
<!-- Ur yezh indezeuropek eo ar brezhoneg --><br />
<br />
===Monolingual dictionaries===<br />
{{see-also|List of dictionaries|Incubator}}<br />
Let's start by making our first source language dictionary. The dictionary is an XML file. Fire up your text editor and type the following:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<br />
</dictionary><br />
</pre><br />
So, the file so far defines that we want to start a dictionary. In order for it to be useful, we need to add some more entries, the first is an alphabet. This defines the set of letters that may be used in the dictionary, for Serbo-Croatian. It will look something like the following, containing all the letters of the Serbo-Croatian alphabet:<br />
<pre><br />
<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet><br />
</pre><br />
<br />
Place the alphabet below the <dictionary> tag.<br />
<br />
Next we need to define some symbols. Let's start off with the simple stuff, noun (n) in singular (sg) and plural (pl).<br />
<pre><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
</pre><br />
The symbol names do not have to be so small, in fact, they could just be written out in full, but as you'll be typing them a lot, it makes sense to abbreviate.<br />
<br />
Unfortunately, it isn't quite so simple. Nouns in Serbo-Croatian inflect for more than just number, they are also inflected for case, and have a gender. However, we'll assume for the purposes of this example that the noun is masculine and in the nominative case (a full example may be found at the end of this document).<br />
<br />
The next thing is to define a section for the paradigms,<br />
<pre><br />
<pardefs><br />
<br />
</pardefs><br />
</pre><br />
and a dictionary section:<br />
<pre><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</pre><br />
There are two types of sections, the first is a standard section, that contains words, enclitics, etc. The second type is an [[inconditional section]] which typically contains punctuation, and so forth. We don't have an inconditional section here.<br />
<br />
So, our file should now look something like:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<pardefs><br />
<br />
</pardefs><br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we've got the skeleton in place, we can start by adding a noun. The noun in question will be 'gramofon' (which means 'gramophone' or 'record player').<br />
<br />
The first thing we need to do, as we have no prior paradigms, is to define a paradigm.<br />
<br />
Remember, we're assuming masculine gender and nominative case. The singular form of the noun is 'gramofon', and the plural is 'gramofoni'. So:<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
Note: the '<l/>' (equivalent to <l></l>) denotes that there is no extra material to be added to the stem for the singular.<br />
<br />
This may seem like a rather verbose way of describing it, but there are reasons for this and it quickly becomes second nature. You're probably wondering what the <e>, <p>, <l> and <r> stand for. Well,<br />
<br />
* e, is for entry.<br />
* p, is for pair.<br />
* l, is for left.<br />
* r, is for right.<br />
<br />
Why left and right? Well, the morphological dictionaries will later be compiled into finite state machines. Compiling them left to right produces analyses from words, and from right to left produces words from analyses. For example:<br />
<pre><br />
* gramofoni (left to right) gramofon<n><pl> (analysis)<br />
* gramofon<n><pl> (right to left) gramofoni (generation)<br />
</pre><br />
Now we've defined a paradigm, we need to link it to its lemma, gramofon. We put this in the section that we've defined.<br />
<br />
The entry to put in the <section> will look like:<br />
<pre><br />
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e><br />
</pre><br />
A quick run down on the abbreviations:<br />
<br />
* lm, is for lemma.<br />
* i, is for identity (the left and the right are the same).<br />
* par, is for paradigm.<br />
<br />
This entry states the lemma of the word, gramofon, the root, gramofon and the paradigm with which it inflects gramofon__n. The difference between the lemma and the root is that the lemma is the citation form of the word, while the root is the substring of the lemma to which suffixes are added. This will become clearer later when we show an entry where the two are different.<br />
<br />
We're now ready to test the dictionary. Save it, and then return to the shell. We first need to compile it (with lt-comp), then we can test it (with lt-proc). For those who are new to cygwin just take note that you need to save the dictionary file inside the home folder (for example C:\Apertium\home\Username\filename_of_dictionary). Otherwise you will not be able to compile.<br />
<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
</pre><br />
Should produce the output:<br />
<pre><br />
main@standard 12 12<br />
</pre><br />
As we are compiling it left to right, we're producing an analyser. Lets make a generator too.<br />
<pre><br />
$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin<br />
</pre><br />
At this stage, the command should produce the same output.<br />
<br />
We can now test these. Run lt-proc on the analyser.<br />
<pre><br />
$ lt-proc sh-en.automorf.bin<br />
</pre><br />
Now try it out, type in gramofoni (gramophones), and see the output:<br />
<pre><br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, for the English dictionary, do the same thing, but substitute the English word gramophone for gramofon, and change the plural inflection. What if you want to use the more correct word 'record player'? Well, we'll explain how to do that later.<br />
<br />
You should now have two files in the directory:<br />
<br />
* apertium-sh-en.sh.dix which contains a (very) basic Serbo-Croatian morphological dictionary, and<br />
* apertium-sh-en.en.dix which contains a (very) basic English morphological dictionary.<br />
<br />
===Bilingual dictionary===<br />
<br />
So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. All dictionaries use the same format (which is specified in the DTD, dix.dtd).<br />
<br />
Create a new file, apertium-sh-en.sh-en.dix and add the basic skeleton:<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<dictionary><br />
<alphabet/><br />
<sdefs><br />
<sdef n="n"/><br />
<sdef n="sg"/><br />
<sdef n="pl"/><br />
</sdefs><br />
<br />
<section id="main" type="standard"><br />
<br />
</section><br />
</dictionary><br />
</pre><br />
Now we need to add an entry to translate between the two words. Something like:<br />
<pre><br />
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e><br />
</pre><br />
Because there are a lot of these entries, they're typically written on one line to facilitate easier reading of the file. Again with the 'l' and 'r' right? Well, we compile it left to right to produce the Serbo-Croatian → English dictionary, and right to left to produce the English → Serbo-Croatian dictionary.<br />
<br />
So, once this is done, run the following commands:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
$ lt-comp rl apertium-sh-en.en.dix sh-en.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
$ lt-comp rl apertium-sh-en.sh.dix en-sh.autogen.bin<br />
<br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
</pre><br />
To generate the morphological analysers (automorf), the morphological generators (autogen) and the word lookups (autobil), the bil is for "bilingual".<br />
<br />
===Transfer rules===<br />
<br />
So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rule files have their own DTD (transfer.dtd) which can be found in the Apertium package. If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages. For example the one described below would be useful for any null-subject language.<br />
<br />
Start out like all the others with a basic skeleton ( apertium-sh-en.sh-en.t1x ) :<br />
<pre><br />
<?xml version="1.0" encoding="UTF-8"?><br />
<transfer><br />
<br />
</transfer><br />
</pre><br />
At the moment, because we're ignoring case, we just need to make a rule that takes the grammatical symbols input and outputs them again.<br />
<br />
We first need to define categories and attributes. Categories and attributes both allow us to group grammatical symbols. Categories allow us to group symbols for the purposes of matching (for example 'n.*' is all nouns). Attributes allow us to group a set of symbols that can be chosen from. For example ('sg' and 'pl' may be grouped a an attribute 'number').<br />
<br />
Lets add the necessary sections:<br />
<pre><br />
<section-def-cats><br />
<br />
</section-def-cats><br />
<section-def-attrs><br />
<br />
</section-def-attrs><br />
</pre><br />
As we're only inflecting, nouns in singular and plural then we need to add a category for nouns, and with an attribute of number. Something like the following will suffice:<br />
<br />
Into section-def-cats add:<br />
<pre><br />
<def-cat n="nom"><br />
<cat-item tags="n.*"/><br />
</def-cat><br />
</pre><br />
This catches all nouns (lemmas followed by <n> then anything) and refers to them as "nom" (we'll see how that's used later).<br />
<br />
Into the section section-def-attrs, add:<br />
<pre><br />
<def-attr n="nbr"><br />
<attr-item tags="sg"/><br />
<attr-item tags="pl"/><br />
</def-attr><br />
</pre><br />
and then<br />
<pre><br />
<def-attr n="a_nom"><br />
<attr-item tags="n"/><br />
</def-attr><br />
</pre><br />
The first defines the attribute nbr (number), which can be either singular (sg) or plural (pl).<br />
<br />
The second defines the attribute a_nom (attribute noun).<br />
<br />
Next we need to add a section for global variables:<br />
<pre><br />
<section-def-vars><br />
<br />
</section-def-vars><br />
</pre><br />
These variables are used to store or transfer attributes between rules. We need only one for now,<br />
<pre><br />
<def-var n="number"/><br />
</pre><br />
Finally, we need to add a rule, to take in the noun and then output it in the correct form. We'll need a rules section...<br />
<pre><br />
<section-rules><br />
<br />
</section-rules><br />
</pre><br />
Changing the pace from the previous examples, I'll just paste this rule, then go through it, rather than the other way round.<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="nom"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_nom"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
<br />
The first tag is obvious, it defines a rule. The second tag, pattern basically says: "apply this rule, if this pattern is found". In this example the pattern consists of a single noun (defined by the category item nom). Note that patterns are matched in a longest-match first. So, say you have three rules, the first catches "<prn><vblex><n>", the second catches "<prn><vblex>" and the third catches "<n>". The pattern matched, and rule executed would be the first one.<br />
<br />
For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).<br />
<br />
The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item.<br />
<br />
Let's compile it and test it. Transfer rules are compiled with:<br />
<pre><br />
$ apertium-preprocess-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin<br />
</pre><br />
Which will generate a <code>sh-en.t1x.bin</code> file.<br />
<br />
Now we're ready to test our machine translation system. There is one crucial part missing, the part-of-speech (PoS) tagger, but that will be explained shortly. In the meantime we can test it as is:<br />
<br />
First, lets analyse a word, gramofoni:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin <br />
^gramofoni/gramofon<n><pl>$<br />
</pre><br />
Now, normally here the POS tagger would choose the right version based on the part of speech, but we don't have a POS tagger yet, so we can use this little gawk script (thanks to Sergio) that will just output the first item retrieved.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}'<br />
^gramofon<n><pl>$<br />
</pre><br />
Now let's process that with the transfer rule:<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
</pre><br />
It will output:<br />
<pre><br />
^gramophone<n><pl>$^@<br />
</pre><br />
* 'gramophone' is the target language (side="tl") lemma (lem) at position 1 (pos="1").<br />
* '<n>' is the target language a_nom at position 1.<br />
* '<pl>' is the target language attribute of number (nbr) at position 1.<br />
<br />
Try commenting out one of these clip statements, recompiling and seeing what happens.<br />
<br />
So, now we have the output from the transfer, the only thing that remains is to generate the target-language inflected forms. For this, we use lt-proc, but in generation (-g), not analysis mode.<br />
<pre><br />
$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
<br />
gramophones\@<br />
</pre><br />
And c'est ca. You now have a machine translation system that translates a Serbo-Croatian noun into an English noun. Obviously this isn't very useful, but we'll get onto the more complex stuff soon. Oh, and don't worry about the '@' symbol, I'll explain that soon too.<br />
<br />
Think of a few other words that inflect the same as gramofon. How about adding those. We don't need to add any paradigms, just the entries in the main section of the monolingual and bilingual dictionaries.<br />
<br />
==Bring on the verbs==<br />
<br />
Ok, so we have a system that translates nouns, but that's pretty useless, we want to translate verbs too, and even whole sentences! How about we start with the verb to see. In Serbo-Croatian this is videti. Serbo-Croatian is a null-subject language, this means that it doesn't typically use personal pronouns before the conjugated form of the verb. English is not. So for example: I see in English would be translated as vidim in Serbo-Croatian.<br />
<br />
* Vidim<br />
* see<p1><sg><br />
* I see<br />
<br />
Note: <code><p1></code> denotes first person<br />
<br />
This will be important when we come to write the transfer rule for verbs. Other examples of null-subject languages include: Spanish, Romanian and Polish. This also has the effect that while we only need to add the verb in the Serbo-Croatian morphological dictionary, we need to add both the verb, and the personal pronouns in the English morphological dictionary. We'll go through both of these.<br />
<br />
The other forms of the verb videti are: vidiš, vidi, vidimo, vidite, and vide; which correspond to: you see (singular), he sees, we see, you see (plural), and they see.<br />
<br />
There are two forms of you see, one is plural and formal singular (vidite) and the other is singular and informal (vidiš).<br />
<br />
We're going to try and translate the sentence: "Vidim gramofoni" into "I see gramophones". In the interests of space, we'll just add enough information to do the translation and will leave filling out the paradigms (adding the other conjugations of the verb) as an exercise to the reader.<br />
<br />
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, in the paradigm definition just copy the 'e' block for 'i' and change the 'i' to 'e' there.<br />
<br />
<pre><br />
<pardef n="gramofon__n"><br />
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e><br />
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e><br />
<e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e><br />
</pardef><br />
</pre><br />
<br />
First thing we need to do is add some more symbols. We need to first add a symbol for 'verb', which we'll call "vblex" (this means lexical verb, as opposed to modal verbs and other types). Verbs have 'person', and 'tense' along with number, so lets add a couple of those as well. We need to translate "I see", so for person we should add "p1", or 'first person', and for tense "pri", or 'present indicative'.<br />
<pre><br />
<sdef n="vblex"/><br />
<sdef n="p1"/><br />
<sdef n="pri"/><br />
</pre><br />
After we've done this, the same with the nouns, we add a paradigm for the verb conjugation. The first line will be:<br />
<pre><br />
<pardef n="vid/eti__vblex"><br />
</pre><br />
The '/' is used to demarcate where the stems (the parts between the <l> </l> tags) are added to.<br />
<br />
Then the inflection for first person singular:<br />
<pre><br />
<br />
<e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e><br />
<br />
</pre><br />
The 'im' denotes the ending (as in 'vidim'), it is necessary to add 'eti' to the <r> section, as this will be chopped off by the definition. The rest is fairly straightforward, 'vblex' is lexical verb, 'pri' is present indicative tense, 'p1' is first person and 'sg' is singular. We can also add the plural which will be the same, except 'imo' instead of 'im' and 'pl' instead of 'sg'.<br />
<br />
After this we need to add a lemma, paradigm mapping to the main section:<br />
<pre><br />
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e><br />
</pre><br />
Note: the content of <nowiki><i> </i></nowiki> is the root, not the lemma.<br />
<br />
That's the work on the Serbo-Croatian dictionary done for now. Lets compile it then test it.<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin<br />
main@standard 23 25<br />
$ echo "vidim" | lt-proc sh-en.automorf.bin<br />
^vidim/videti<vblex><pri><p1><sg>$<br />
$ echo "vidimo" | lt-proc sh-en.automorf.bin<br />
^vidimo/videti<vblex><pri><p1><pl>$<br />
</pre><br />
Ok, so now we do the same for the English dictionary (remember to add the same symbol definitions here as you added to the Serbo-Croatian one).<br />
<br />
The paradigm is:<br />
<pre><br />
<pardef n="s/ee__vblex"><br />
</pre><br />
because the past tense is 'saw'. Now, we can do one of two things, we can add both first and second person, but they are the same form. In fact, all forms (except third person singular) of the verb 'to see' are 'see'. So instead we make one entry for 'see' and give it only the 'pri' symbol.<br />
<pre><br />
<br />
<e><p><l>ee</l><r>ee<s n="vblex"/><s n="pri"/></r></p></e><br />
<br />
</pre><br />
and as always, an entry in the main section:<br />
<pre><br />
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e><br />
</pre><br />
Then lets save, recompile and test:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin<br />
main@standard 18 19<br />
<br />
$ echo "see" | lt-proc en-sh.automorf.bin<br />
^see/see<vblex><pri>$<br />
</pre><br />
Now for the obligatory entry in the bilingual dictionary:<br />
<pre><br />
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e><br />
</pre><br />
(again, don't forget to add the sdefs from earlier)<br />
<br />
And recompile:<br />
<pre><br />
$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin<br />
main@standard 18 18<br />
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin<br />
main@standard 18 18<br />
</pre><br />
Now to test:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
<br />
^see<vblex><pri><p1><sg>$^@<br />
</pre><br />
We get the analysis passed through correctly, but when we try and generate a surface form from this, we get a '#', like below:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
#see\@<br />
</pre><br />
This '#' means that the generator cannot generate the correct lexical form because it does not contain it. Why is this?<br />
<br />
Basically the analyses don't match, the 'see' in the dictionary is see<vblex><pri>, but the see delivered by the transfer is see<vblex><pri><p1><sg>. The Serbo-Croatian side has more information than the English side requires. You can test this by adding the missing symbols to the English dictionary, and then recompiling, and testing again.<br />
<br />
However, a more paradigmatic way of taking care of this is by writing a rule. So, we open up the rules file (<code>apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin</code> in case you forgot).<br />
<br />
We need to add a new category for 'verb'.<br />
<pre><br />
<def-cat n="vrb"><br />
<cat-item tags="vblex.*"/><br />
</def-cat><br />
</pre><br />
We also need to add attributes for tense and for person. We'll make it really simple for now, you can add p2 and p3, but I won't in order to save space.<br />
<pre><br />
<def-attr n="temps"><br />
<attr-item tags="pri"/><br />
</def-attr><br />
<br />
<def-attr n="pers"><br />
<attr-item tags="p1"/><br />
</def-attr><br />
</pre><br />
We should also add an attribute for verbs.<br />
<pre><br />
<def-attr n="a_verb"><br />
<attr-item tags="vblex"/><br />
</def-attr><br />
</pre><br />
Now onto the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
Remember when you tried commenting out the 'clip' tags in the previous rule example and they disappeared from the transfer, well, that's pretty much what we're doing here. We take in a verb with a full analysis, but only output a partial analysis (lemma + verb tag + tense tag).<br />
<br />
So now, if we recompile that, we get:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin<br />
^see<vblex><pri>$^@<br />
</pre><br />
and:<br />
<pre><br />
$ echo "vidim" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see\@<br />
</pre><br />
Try it with 'vidimo' (we see) to see if you get the correct output.<br />
<br />
Now try it with "vidim gramofone":<br />
<pre><br />
$ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
see gramophones\@<br />
</pre><br />
<br />
==But what about personal pronouns?==<br />
<br />
Well, that's great, but we're still missing the personal pronoun that is necessary in English. In order to add it in, we first need to edit the English morphological dictionary.<br />
<br />
As before, the first thing to do is add the necessary symbols:<br />
<pre><br />
<sdef n="prn"/><br />
<sdef n="subj"/><br />
</pre><br />
Of the two symbols, prn is pronoun, and subj is subject (as in the subject of a sentence).<br />
<br />
Because there is no root, or 'lemma' for personal subject pronouns, we just add the pardef as follows:<br />
<pre><br />
<pardef n="prsubj__prn"><br />
<e><p><l>I</l><r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r></p></e><br />
</pardef><br />
</pre><br />
With 'prsubj' being 'personal subject'. The rest of them (You, We etc.) are left as an exercise to the reader.<br />
<br />
We can add an entry to the main section as follows:<br />
<pre><br />
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e><br />
</pre><br />
So, save, recompile and test, and we should get something like:<br />
<pre><br />
$ echo "I" | lt-proc en-sh.automorf.bin<br />
^I/PRPERS<prn><subj><p1><sg>$<br />
</pre><br />
<br />
(Note: it's in capitals because 'I' is in capitals).<br />
<br />
Now we need to amend the 'verb' rule to output the subject personal pronoun along with the correct verb form.<br />
<br />
First, add a category (this must be getting pretty pedestrian by now):<br />
<pre><br />
<def-cat n="prpers"><br />
<cat-item lemma="prpers" tags="prn.*"/><br />
</def-cat><br />
</pre><br />
Now add the types of pronoun as attributes, we might as well add the 'obj' type as we're at it, although we won't need to use it for now:<br />
<pre><br />
<def-attr n="tipus_prn"><br />
<attr-item tags="prn.subj"/><br />
<attr-item tags="prn.obj"/><br />
</def-attr><br />
</pre><br />
And now to input the rule:<br />
<pre><br />
<rule><br />
<pattern><br />
<pattern-item n="vrb"/><br />
</pattern><br />
<action><br />
<out><br />
<lu><br />
<lit v="prpers"/><br />
<lit-tag v="prn"/><br />
<lit-tag v="subj"/><br />
<clip pos="1" side="tl" part="pers"/><br />
<clip pos="1" side="tl" part="nbr"/><br />
</lu><br />
<b/><br />
<lu><br />
<clip pos="1" side="tl" part="lem"/><br />
<clip pos="1" side="tl" part="a_verb"/><br />
<clip pos="1" side="tl" part="temps"/><br />
</lu><br />
</out><br />
</action><br />
</rule><br />
</pre><br />
This is pretty much the same rule as before, only we made a couple of small changes.<br />
<br />
We needed to output:<br />
<pre><br />
^prpers<prn><subj><p1><sg>$ ^see<vblex><pri>$<br />
</pre><br />
so that the generator could choose the right pronoun and the right form of the verb.<br />
<br />
So, a quick rundown:<br />
<br />
* <code><lit></code>, prints a literal string, in this case "prpers"<br />
* <code><lit-tag></code>, prints a literal tag, because we can't get the tags from the verb, we add these ourself, "prn" for pronoun, and "subj" for subject.<br />
* <code><b/></code>, prints a blank, a space.<br />
<br />
Note that we retrieved the information for number and tense directly from the verb.<br />
<br />
So, now if we recompile and test that again:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see gramophones<br />
</pre><br />
Which, while it isn't exactly prize-winning prose (much like this HOWTO), is a fairly accurate translation.<br />
<br />
==So tell me about the record player (Multiwords)==<br />
<br />
While gramophone is an English word, it isn't the best translation. Gramophone is typically used for the very old kind, you know with the needle instead of the stylus, and no powered amplification. A better translation would be 'record player'. Although this is more than one word, we can treat it as if it is one word by using multiword (multipalabra) constructions.<br />
<br />
We don't need to touch the Serbo-Croatian dictionary, just the English one and the bilingual one, so open it up.<br />
<br />
The plural of 'record player' is 'record players', so it takes the same paradigm as gramophone (gramophone__n) — in that we just add 's'. All we need to do is add a new element to the main section.<br />
<pre><br />
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e><br />
</pre><br />
The only thing different about this is the use of the <b/> tag, although this isn't entirely new as we saw it in use in the rules file.<br />
<br />
So, recompile and test in the orthodox fashion:<br />
<pre><br />
$ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \<br />
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \<br />
apertium-transfer apertium-sh-en.sh-en.t1x sh-en.t1x.bin sh-en.autobil.bin | \<br />
lt-proc -g sh-en.autogen.bin<br />
I see record players<br />
</pre><br />
Perfect. A big benefit of using multiwords is that you can translate idiomatic expressions verbatim, without having to do word-by-word translation. For example the English phrase, "at the moment" would be translated into Serbo-Croatian as "trenutno" (trenutak = ''moment'', trenutno being adverb of that) &mdash; it would not be possible to translate this English phrase word-by-word into Serbo-Croatian.<br />
<br />
==Dealing with minor variation==<br />
<br />
Serbo-Croatian is an umbrella term for several standard languages, so there are differences in pronounciation and ortography. There is a cool phonetic writing system so you write how you speak. A notable example is the pronounciation of the proto-Slavic vowel ''yat''. The word for dictionary can for instance be either "rječnik" (called Ijekavian), or "rečnik" (called Ekavian).<br />
<br />
===Analysis===<br />
<br />
There should be a fairly easy way of dealing with this, and there is, using paradigms again. Paradigms aren't only used for adding grammatical symbols, but they can also be used to replace any character/symbol with another. For example, here is a paradigm for accepting both "e" and "je" in the analysis. The paradigm should, as with the others go into the monolingual dictionary for Serbo-Croatian.<br />
<br />
<pre><br />
<pardef n="e_je__yat"><br />
<e><br />
<p><br />
<l>e</l><br />
<r>e</r><br />
</p><br />
</e><br />
<e><br />
<p><br />
<l>je</l><br />
<r>e</r><br />
</p><br />
</e><br />
</pardef><br />
</pre><br />
<br />
Then in the "main section":<br />
<br />
<pre><br />
<e lm="rečnik"><i>r</i><par n="e_je__yat"/><i>čni</i><par n="rečni/k__n"/></e><br />
</pre><br />
<br />
This only allows us to analyse both forms however... more work is necessary if we want to generate both forms.<br />
<br />
===Generation===<br />
<br />
==See also==<br />
<br />
*[[Building dictionaries]]<br />
*[[Cookbook]] <br />
*[[Chunking]]<br />
*[[Contributing to an existing pair]]<br />
<br />
[[Category:Documentation in English]]<br />
[[Category:HOWTO]]<br />
[[Category:Writing dictionaries]]<br />
[[Category:Quickstart]]</div>Grégoire