Difference between revisions of "Compiling dictionaries"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
{{TOCD}}
{{TOCD}}
This page gives some specific instructions for compiling dictionaries from various language pairs using different build procedures, this page is principally for people who are interested in using the dictionaries as analysers or generators, and not as part of a language pair.
This page gives some specific instructions for compiling dictionaries from various language pairs using different build procedures, this page is principally for people who are interested in using the dictionaries as analysers or generators, and not as part of a language pair.

===Standard dix compilation===
==Standard lttoolbox dix compilation==
Assuming you want to compile the file <code>apertium-bn-en.bn.dix</code> and save it as <code>bn.analyser.bin</code>:
Assuming you want to compile an lttoolbox XML dictionary file <code>apertium-bn-en.bn.dix</code> and save it as <code>bn.analyser.bin</code>:
<pre>
<pre>
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin
Line 9: Line 10:
</pre>
</pre>


See [[lttoolbox]] on how to analyse using this dictionary.
==English&mdash;Catalan==

==Metadix compilation==
Some languages use non-standard extensions of the lttoolbox XML format. The term '''[[metadix]]''' covers any such extended dictionary. These are typically processed with xsltproc, and then the processed output of xsltproc is compiled like standard dix files. Since each analyser might use different xslt scripts, there is no one single procedure to compile all such metadix dictionaries.

In general, it is easier to compile a language module or language pair using the regular procedure for that module (see [[Installation]]), but we give some examples here for how to perform only the essential commands to compile certain metadix analysers:

===English&mdash;Catalan===


The English dictionary in English&mdash;Catalan, along with the English dictionaries in some other pairs (e.g. English&mdash;Spanish and English&mdash;Galician) uses a [[metadix]] file. This needs to be preprocessed before it can be compiled with <code>lt-comp</code>.
The English dictionary in English&mdash;Catalan, along with the English dictionaries in some other pairs (e.g. English&mdash;Spanish and English&mdash;Galician) uses a [[metadix]] file. This needs to be preprocessed before it can be compiled with <code>lt-comp</code>.
Line 25: Line 33:
</pre>
</pre>


==Breton&mdash;French==
===Breton&mdash;French===


The French dictionary in Breton&mdash;French is a [[metadix]] file which needs to be preprocessed before it can be compiled with <code>lt-comp</code>.
The French dictionary in Breton&mdash;French is a [[metadix]] file which needs to be preprocessed before it can be compiled with <code>lt-comp</code>.
Line 38: Line 46:
</pre>
</pre>


==Portuguese&mdash;Spanish==
===Portuguese&mdash;Spanish===

===Occitan&mdash;Catalan===

===French&mdash;Spanish===


==HFST lexc/twol ==
==Occitan&mdash;Catalan==
HFST-based analysers/generators, like metadix, often have compilation procedures that differ from module to module. In general, it is easier to compile a language module or language pair using the regular procedure for that module (see [[Installation]]).


See also [[HFST]].
==French&mdash;Spanish==


==See also==
==See also==

Revision as of 08:12, 8 December 2013

This page gives some specific instructions for compiling dictionaries from various language pairs using different build procedures, this page is principally for people who are interested in using the dictionaries as analysers or generators, and not as part of a language pair.

Standard lttoolbox dix compilation

Assuming you want to compile an lttoolbox XML dictionary file apertium-bn-en.bn.dix and save it as bn.analyser.bin:

$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin
final@inconditional 8 75
main@standard 6403 13351

See lttoolbox on how to analyse using this dictionary.

Metadix compilation

Some languages use non-standard extensions of the lttoolbox XML format. The term metadix covers any such extended dictionary. These are typically processed with xsltproc, and then the processed output of xsltproc is compiled like standard dix files. Since each analyser might use different xslt scripts, there is no one single procedure to compile all such metadix dictionaries.

In general, it is easier to compile a language module or language pair using the regular procedure for that module (see Installation), but we give some examples here for how to perform only the essential commands to compile certain metadix analysers:

English—Catalan

The English dictionary in English—Catalan, along with the English dictionaries in some other pairs (e.g. English—Spanish and English—Galician) uses a metadix file. This needs to be preprocessed before it can be compiled with lt-comp.


$ xsltproc buscaPar.xsl apertium-en-ca.en.metadix | uniq > tmp1gen.xsl
$ xsltproc tmp1gen.xsl apertium-en-ca.en.metadix > apertium-en-ca.en.dixtmp1
$ rm tmp1gen.xsl
$ apertium-validate-acx apertium-en-ca.en.acx
$ apertium-validate-dictionary apertium-en-ca.en.dixtmp1
$ lt-comp lr apertium-en-ca.en.dixtmp1 apertium-en-ca.en.acx
$ rm apertium-en-ca.en.dixtmp1

Breton—French

The French dictionary in Breton—French is a metadix file which needs to be preprocessed before it can be compiled with lt-comp.

xsltproc buscaPar.xsl apertium-br-fr.fr.metadix | uniq > tmp1gen.xsl
xsltproc tmp1gen.xsl apertium-br-fr.fr.metadix > apertium-br-fr.fr.dix
rm tmp1gen.xsl
apertium-validate-dictionary apertium-br-fr.fr.dix
lt-comp rl apertium-br-fr.fr.dix br-fr.autogen.bin

Portuguese—Spanish

Occitan—Catalan

French—Spanish

HFST lexc/twol

HFST-based analysers/generators, like metadix, often have compilation procedures that differ from module to module. In general, it is easier to compile a language module or language pair using the regular procedure for that module (see Installation).

See also HFST.

See also