Difference between revisions of "Compiling dictionaries"
(15 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
[[Compilation des dictionnaires|En français]] |
|||
{{TOCD}} |
{{TOCD}} |
||
This page gives some specific instructions for compiling dictionaries from various language pairs using different build procedures, this page is principally for people who are interested in using the dictionaries as analysers or generators, and not as |
This page gives some specific instructions for compiling dictionaries from various language pairs using different build procedures, this page is principally for people who are interested in using the dictionaries as analysers or generators, and not as part of a language pair. |
||
==Standard lttoolbox dix compilation== |
|||
⚫ | |||
Assuming you want to compile an lttoolbox XML dictionary file <code>apertium-bn-en.bn.dix</code> into an analyser and save it as <code>bn.analyser.bin</code>: |
|||
<pre> |
|||
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin |
|||
final@inconditional 8 75 |
|||
main@standard 6403 13351 |
|||
</pre> |
|||
See [[lttoolbox]] or [[Using an lttoolbox dictionary]] on how to analyse using this dictionary. |
|||
The "lr" argument means "compile from left to right"; for monolingual dictionaries, this produces an analyser. If you gave "rl" instead, it would produce a generator: |
|||
<pre> |
|||
$ lt-comp rl apertium-bn-en.bn.dix bn.generator.bin |
|||
final@inconditional 8 76 |
|||
main@standard 10245 22228 |
|||
</pre> |
|||
==Compilation options and attributes== |
|||
Apart from the direction (lr vs rl), <code>lt-comp</code> also has some options to use values of the "alt", "v", "vl" or "vr" dix attributes in compilation: |
|||
<pre> |
|||
-v: set language variant |
|||
-a: set alternative (monodix) |
|||
-l: set left language variant (bidix) |
|||
-r: set right language variant (bidix) |
|||
</pre> |
|||
The "alt" attribute is used to specify that a given entry is only used in one language variant (like British vs US English spelling). |
|||
The "vr" attribute is used in bidix to say that the <code><r></code> entry is only relevant for that variant, when translating left-to-right (it is included when going the other direction). |
|||
The "vl" attribute is used in bidix to say that the <code><l></code> entry is only relevant for that variant, when translating right-to-left (it is included when going the other direction). |
|||
* TODO: what is "v"? |
|||
==ATT compilation== |
|||
<code>lt-comp</code> can also compile [[ATT format]] files; this includes the output of [[HFST|hfst-fst2txt]]: |
|||
<pre> |
|||
$ hfst-fst2txt bak.automorf.hfst > bak.att |
|||
$ lt-comp lr bak.att bak.automorf.bin |
|||
main@standard 8435 14708 |
|||
final@inconditional 14 34 |
|||
</pre> |
|||
Note that the final@inconditional section is "guessed" (anything starting with punctuation goes into that section). |
|||
==Metadix compilation== |
|||
Some languages use non-standard extensions of the lttoolbox XML format. The term '''[[metadix]]''' covers any such extended dictionary. These are typically processed with xsltproc, and then the processed output of xsltproc is compiled like standard dix files. Since each analyser might use different xslt scripts, there is no one single procedure to compile all such metadix dictionaries. |
|||
In general, it is easier to compile a language module or language pair using the regular procedure for that module (see [[Installation]]), but we give some examples here for how to perform only the essential commands to compile certain metadix analysers: |
|||
⚫ | |||
The English dictionary in English—Catalan, along with the English dictionaries in some other pairs (e.g. English—Spanish and English—Galician) uses a [[metadix]] file. This needs to be preprocessed before it can be compiled with <code>lt-comp</code>. |
The English dictionary in English—Catalan, along with the English dictionaries in some other pairs (e.g. English—Spanish and English—Galician) uses a [[metadix]] file. This needs to be preprocessed before it can be compiled with <code>lt-comp</code>. |
||
Line 18: | Line 69: | ||
</pre> |
</pre> |
||
===Breton—French=== |
|||
⚫ | |||
The French dictionary in Breton—French is a [[metadix]] file which needs to be preprocessed before it can be compiled with <code>lt-comp</code>. |
|||
<pre> |
|||
xsltproc buscaPar.xsl apertium-br-fr.fr.metadix | uniq > tmp1gen.xsl |
|||
xsltproc tmp1gen.xsl apertium-br-fr.fr.metadix > apertium-br-fr.fr.dix |
|||
rm tmp1gen.xsl |
|||
apertium-validate-dictionary apertium-br-fr.fr.dix |
|||
lt-comp rl apertium-br-fr.fr.dix br-fr.autogen.bin |
|||
</pre> |
|||
⚫ | |||
==Occitan |
===Occitan—Catalan=== |
||
===French—Spanish=== |
|||
==HFST lexc/twol == |
|||
'''[[HFST]]'''-based analysers/generators, like metadix, often have compilation procedures that differ from module to module. In general, it is easier to compile a language module or language pair using the regular procedure for that module (see [[Installation]]). |
|||
==See also== |
==See also== |
||
* [[lttoolbox]] |
* [[lttoolbox]] |
||
* [[Using an lttoolbox dictionary]] |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
[[Category:Lttoolbox|*]] |
|||
[[Category:Morphological analysers]] |
|||
[[Category:Documentation in English]] |
Latest revision as of 20:36, 19 March 2021
This page gives some specific instructions for compiling dictionaries from various language pairs using different build procedures, this page is principally for people who are interested in using the dictionaries as analysers or generators, and not as part of a language pair.
Standard lttoolbox dix compilation[edit]
Assuming you want to compile an lttoolbox XML dictionary file apertium-bn-en.bn.dix
into an analyser and save it as bn.analyser.bin
:
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin final@inconditional 8 75 main@standard 6403 13351
See lttoolbox or Using an lttoolbox dictionary on how to analyse using this dictionary.
The "lr" argument means "compile from left to right"; for monolingual dictionaries, this produces an analyser. If you gave "rl" instead, it would produce a generator:
$ lt-comp rl apertium-bn-en.bn.dix bn.generator.bin final@inconditional 8 76 main@standard 10245 22228
Compilation options and attributes[edit]
Apart from the direction (lr vs rl), lt-comp
also has some options to use values of the "alt", "v", "vl" or "vr" dix attributes in compilation:
-v: set language variant -a: set alternative (monodix) -l: set left language variant (bidix) -r: set right language variant (bidix)
The "alt" attribute is used to specify that a given entry is only used in one language variant (like British vs US English spelling).
The "vr" attribute is used in bidix to say that the <r>
entry is only relevant for that variant, when translating left-to-right (it is included when going the other direction).
The "vl" attribute is used in bidix to say that the <l>
entry is only relevant for that variant, when translating right-to-left (it is included when going the other direction).
- TODO: what is "v"?
ATT compilation[edit]
lt-comp
can also compile ATT format files; this includes the output of hfst-fst2txt:
$ hfst-fst2txt bak.automorf.hfst > bak.att $ lt-comp lr bak.att bak.automorf.bin main@standard 8435 14708 final@inconditional 14 34
Note that the final@inconditional section is "guessed" (anything starting with punctuation goes into that section).
Metadix compilation[edit]
Some languages use non-standard extensions of the lttoolbox XML format. The term metadix covers any such extended dictionary. These are typically processed with xsltproc, and then the processed output of xsltproc is compiled like standard dix files. Since each analyser might use different xslt scripts, there is no one single procedure to compile all such metadix dictionaries.
In general, it is easier to compile a language module or language pair using the regular procedure for that module (see Installation), but we give some examples here for how to perform only the essential commands to compile certain metadix analysers:
English—Catalan[edit]
The English dictionary in English—Catalan, along with the English dictionaries in some other pairs (e.g. English—Spanish and English—Galician) uses a metadix file. This needs to be preprocessed before it can be compiled with lt-comp
.
$ xsltproc buscaPar.xsl apertium-en-ca.en.metadix | uniq > tmp1gen.xsl $ xsltproc tmp1gen.xsl apertium-en-ca.en.metadix > apertium-en-ca.en.dixtmp1 $ rm tmp1gen.xsl $ apertium-validate-acx apertium-en-ca.en.acx $ apertium-validate-dictionary apertium-en-ca.en.dixtmp1 $ lt-comp lr apertium-en-ca.en.dixtmp1 apertium-en-ca.en.acx $ rm apertium-en-ca.en.dixtmp1
Breton—French[edit]
The French dictionary in Breton—French is a metadix file which needs to be preprocessed before it can be compiled with lt-comp
.
xsltproc buscaPar.xsl apertium-br-fr.fr.metadix | uniq > tmp1gen.xsl xsltproc tmp1gen.xsl apertium-br-fr.fr.metadix > apertium-br-fr.fr.dix rm tmp1gen.xsl apertium-validate-dictionary apertium-br-fr.fr.dix lt-comp rl apertium-br-fr.fr.dix br-fr.autogen.bin
Portuguese—Spanish[edit]
Occitan—Catalan[edit]
French—Spanish[edit]
HFST lexc/twol[edit]
HFST-based analysers/generators, like metadix, often have compilation procedures that differ from module to module. In general, it is easier to compile a language module or language pair using the regular procedure for that module (see Installation).