Difference between revisions of "Modes"
(28 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
[[Les fichiers modes|En français]] |
|||
There are a few ways you can use [[pipeline]]s in Apertium. One of them is '''Modes''' files. Modes files (typically called <code>modes.xml</code>) are XML files which specify which programs should be run and in what order. Normally each linguistic package has one of these files which specifies various ways in which you can use the data to perform translations. |
|||
There are a few ways you can use translation [[pipeline]]s in Apertium. One of them is '''Modes''' files. Modes files (typically called <code>modes.xml</code>) are XML files (see [https://github.com/apertium/apertium/blob/master/apertium/modes.dtd modes.dtd]) which specify which programs should be run and in what order. The pipeline(s) specified in the XML is converted to one more more files containing individual shell pipelines. Normally each linguistic package has one of these XML files which specifies various ways in which you can use the data to perform translations. |
|||
==Installation behaviour== |
|||
See the [https://github.com/apertium/apertium-spa-cat/blob/master/modes.xml modes file from Spanish-Catalan pair] for an example. The modes which do not say <code>install="yes"</code> are only usable with the -d switch to apertium, these are typically used during development (eg. ca-es-anmor which only performs morphological analysis on Catalan and nothing else). |
|||
Each mode in the modes file may be marked to be "installed". This means that when you run <code>make install</code> the file will be installed into the prefix and be available to the <code>apertium</code> script without having to type in the directory. Those which aren't installed are created in the <code>/modes/</code> directory in the source directory of the package. <u>Note</u>: if you choose to install a mode, it will not be put into the source <code>/modes/</code> directory and you will not be able to run it from the source directory (see below). |
|||
See [[Writing_Makefiles#Modes]] on how to ensure modes that say install="yes" are installed. |
|||
For example, lets say you have a package <code>apertium-fr-es</code> and you install the modes, <code>fr-es</code> and <code>es-fr</code>, you have the modes <code>fr-es-anmorf</code> and <code>es-fr-anmorf</code>, but don't want them to be installed. This means that you can call apertium thus: |
|||
The program <code>apertium-gen-modes</code> turns modes.xml descriptions into individual executable pipelines in <code>modes/*.mode</code>. |
|||
== Naming conventions == |
|||
The main translation mode is always named "from-to", e.g. "sme-nob". The debug modes each have a suffix, e.g. "sme-nob-morph". |
|||
Common debug mode names: |
|||
* -anmor or -morph run the morphological analysers |
|||
** these are used equivalently |
|||
* -disam runs up until morphological (CG) disambiguation |
|||
* -syntax runs up until syntactical (CG) disambiguation |
|||
* -tagger runs up until probabilistic (apertium-tagger) disambiguation (or, if no .prob, up until the last disambiguation step) |
|||
* -autoseq runs up until separable multiwords |
|||
* -biltrans runs up until the bidix |
|||
* -lex runs up until lexical selection |
|||
* -anaph runs up until anaphora resolution |
|||
* -transfer runs up until (1-stage) transfer |
|||
* -chunker runs up until the first stage of 3-or-more-stage transfer |
|||
* -interchunk runs up until the second stage of 3-stage transfer |
|||
** -interchunk1 and -interchunk2 are used when the pair has 4-stage transfer |
|||
* -postchunk runs up until the last stage of transfer |
|||
* -dgen run up until generation (using lt-proc -d to include debug symbols) |
|||
== Autogenerating debug modes == |
|||
It's a drag to create all the regular install=no debug modes all the time (e.g. foo-bar-tagger, foo-bar-chunker, etc.). |
|||
If you put '''<code>gendebug="yes"</code>''' on a <mode> element, debug modes will be created automatically for you! If a mode comes out with the wrong suffix, you can override the guess by adding the <code>debug-suff</code> attribute. |
|||
For example, the following entry |
|||
<pre> |
<pre> |
||
<mode name="sme-nob" install="yes" gendebug="yes"> |
|||
$ apertium -d . fr-es-anmorf |
|||
<pipeline> |
|||
<program name="hfst-proc --weight-classes 1 -w -p"> |
|||
<file name="sme-nob.automorf.hfst"/> |
|||
</program> |
|||
<program name="cg-proc" debug-suff="disam"> |
|||
<file name="sme-nob.mor.rlx.bin"/> |
|||
</program> |
|||
<program name="cg-proc -1 -n -w" debug-suff="syntax"> |
|||
<file name="sme-nob.syn.rlx.bin"/> |
|||
</program> |
|||
<program name="apertium-pretransfer"/> |
|||
<program name="lt-proc -b"> |
|||
<file name="sme-nob.autobil.bin"/> |
|||
</program> |
|||
<program name="cg-proc" debug-suff="lex"> |
|||
<file name="sme-nob.lex.bin"/> |
|||
</program> |
|||
<program name="apertium-transfer -b"> |
|||
<file name="apertium-sme-nob.sme-nob.t1x"/> |
|||
<file name="sme-nob.t1x.bin"/> |
|||
</program> |
|||
<program name="apertium-interchunk" debug-suff="interchunk1"> |
|||
<file name="apertium-sme-nob.sme-nob.t2x"/> |
|||
<file name="sme-nob.t2x.bin"/> |
|||
</program> |
|||
<program name="apertium-interchunk" debug-suff="interchunk2"> |
|||
<file name="apertium-sme-nob.sme-nob.t3x"/> |
|||
<file name="sme-nob.t3x.bin"/> |
|||
</program> |
|||
<program name="lt-proc $1"> |
|||
<file name="sme-nob.autogen.bin"/> |
|||
</program> |
|||
</pipeline> |
|||
</mode> |
|||
</pre> |
</pre> |
||
would make all these modes automatically: |
|||
* sme-nob.mode |
|||
* sme-nob-morph.mode |
|||
* sme-nob-disam.mode |
|||
* sme-nob-syntax.mode |
|||
* sme-nob-pretransfer.mode |
|||
* sme-nob-biltrans.mode |
|||
* sme-nob-lex.mode |
|||
* sme-nob-chunker.mode |
|||
* sme-nob-interchunk1.mode |
|||
* sme-nob-interchunk2.mode |
|||
* sme-nob-postchunk.mode |
|||
* sme-nob-dgen.mode |
|||
(but only install the sme-nob.mode), where -disam, -syntax, -interchunk1, -interchunk2 are manually specified names, and the rest are default names based on program names. |
|||
== Modes hacks == |
|||
But you ''cannot'' call it thus: |
|||
=== Statistics mode === |
|||
In order to get some statistical information about translations made using Apertium, we've hacked the main translation mode, pausing the pipeline just after disambiguation and saving the output into a temp file. After that, pipeline is resumed with temp file as stdin. |
|||
As an example, you can see the /broken/ pipeline for ca-es, installed as <code>ca-es-estadistiques.mode</code> |
|||
<pre> |
<pre> |
||
/usr/local/bin/lt-proc /usr/local/share/apertium/apertium-es-ca/ca-es.automorf.bin > $LOGSDIR$SEC.tmp; |
|||
$ apertium -d . fr-es |
|||
/usr/local/bin/apertium-tagger -g /usr/local/share/apertium/apertium-es-ca/ca-es.prob < $LOGSDIR$SEC.tmp \ |
|||
|/usr/local/bin/apertium-pretransfer|/usr/local/bin/apertium-transfer /usr/local/share/apertium/apertium-es-ca/apertium-es-ca.trules-ca-es.xml \ |
|||
/usr/local/share/apertium/apertium-es-ca/trules-ca-es.bin /usr/local/share/apertium/apertium-es-ca/ca-es.autobil.bin \ |
|||
|/usr/local/bin/lt-proc $1 /usr/local/share/apertium/apertium-es-ca/ca-es.autogen.bin \ |
|||
|/usr/local/bin/lt-proc -p /usr/local/share/apertium/apertium-es-ca/ca-es.autogen.bin |
|||
</pre> |
</pre> |
||
And an example of calling apertium with this mode would be the following |
|||
If you want this second behaviour, you need to do: |
|||
<pre> |
<pre> |
||
LOGSDIR=~/logs/apertium/; SEC=`date +%s`; |
|||
cp *.mode modes/ |
|||
echo "Ara Apertium permet extraure estadístiques" | apertium ca-es-estadistiques |
|||
</pre> |
</pre> |
||
In that example, $LOGSDIR is a folder where the logs will be saved, and $SEC is an unique ID for that log. |
|||
When translation is done, we can process the log created in order to get statistics. |
|||
=== Mixed modes === |
|||
See [[Mixed modes]] |
|||
==See also== |
|||
* https://victorio.uit.no/langtech/trunk/tools/CorpusTools/corpustools/modes.py is a Python module that parses modes files into a <code>Pipeline</code> class |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
[[Category:Documentation in English]] |
Latest revision as of 07:29, 23 December 2024
There are a few ways you can use translation pipelines in Apertium. One of them is Modes files. Modes files (typically called modes.xml
) are XML files (see modes.dtd) which specify which programs should be run and in what order. The pipeline(s) specified in the XML is converted to one more more files containing individual shell pipelines. Normally each linguistic package has one of these XML files which specifies various ways in which you can use the data to perform translations.
See the modes file from Spanish-Catalan pair for an example. The modes which do not say install="yes"
are only usable with the -d switch to apertium, these are typically used during development (eg. ca-es-anmor which only performs morphological analysis on Catalan and nothing else).
See Writing_Makefiles#Modes on how to ensure modes that say install="yes" are installed.
The program apertium-gen-modes
turns modes.xml descriptions into individual executable pipelines in modes/*.mode
.
Contents
Naming conventions[edit]
The main translation mode is always named "from-to", e.g. "sme-nob". The debug modes each have a suffix, e.g. "sme-nob-morph".
Common debug mode names:
- -anmor or -morph run the morphological analysers
- these are used equivalently
- -disam runs up until morphological (CG) disambiguation
- -syntax runs up until syntactical (CG) disambiguation
- -tagger runs up until probabilistic (apertium-tagger) disambiguation (or, if no .prob, up until the last disambiguation step)
- -autoseq runs up until separable multiwords
- -biltrans runs up until the bidix
- -lex runs up until lexical selection
- -anaph runs up until anaphora resolution
- -transfer runs up until (1-stage) transfer
- -chunker runs up until the first stage of 3-or-more-stage transfer
- -interchunk runs up until the second stage of 3-stage transfer
- -interchunk1 and -interchunk2 are used when the pair has 4-stage transfer
- -postchunk runs up until the last stage of transfer
- -dgen run up until generation (using lt-proc -d to include debug symbols)
Autogenerating debug modes[edit]
It's a drag to create all the regular install=no debug modes all the time (e.g. foo-bar-tagger, foo-bar-chunker, etc.).
If you put gendebug="yes"
on a <mode> element, debug modes will be created automatically for you! If a mode comes out with the wrong suffix, you can override the guess by adding the debug-suff
attribute.
For example, the following entry
<mode name="sme-nob" install="yes" gendebug="yes"> <pipeline> <program name="hfst-proc --weight-classes 1 -w -p"> <file name="sme-nob.automorf.hfst"/> </program> <program name="cg-proc" debug-suff="disam"> <file name="sme-nob.mor.rlx.bin"/> </program> <program name="cg-proc -1 -n -w" debug-suff="syntax"> <file name="sme-nob.syn.rlx.bin"/> </program> <program name="apertium-pretransfer"/> <program name="lt-proc -b"> <file name="sme-nob.autobil.bin"/> </program> <program name="cg-proc" debug-suff="lex"> <file name="sme-nob.lex.bin"/> </program> <program name="apertium-transfer -b"> <file name="apertium-sme-nob.sme-nob.t1x"/> <file name="sme-nob.t1x.bin"/> </program> <program name="apertium-interchunk" debug-suff="interchunk1"> <file name="apertium-sme-nob.sme-nob.t2x"/> <file name="sme-nob.t2x.bin"/> </program> <program name="apertium-interchunk" debug-suff="interchunk2"> <file name="apertium-sme-nob.sme-nob.t3x"/> <file name="sme-nob.t3x.bin"/> </program> <program name="lt-proc $1"> <file name="sme-nob.autogen.bin"/> </program> </pipeline> </mode>
would make all these modes automatically:
- sme-nob.mode
- sme-nob-morph.mode
- sme-nob-disam.mode
- sme-nob-syntax.mode
- sme-nob-pretransfer.mode
- sme-nob-biltrans.mode
- sme-nob-lex.mode
- sme-nob-chunker.mode
- sme-nob-interchunk1.mode
- sme-nob-interchunk2.mode
- sme-nob-postchunk.mode
- sme-nob-dgen.mode
(but only install the sme-nob.mode), where -disam, -syntax, -interchunk1, -interchunk2 are manually specified names, and the rest are default names based on program names.
Modes hacks[edit]
Statistics mode[edit]
In order to get some statistical information about translations made using Apertium, we've hacked the main translation mode, pausing the pipeline just after disambiguation and saving the output into a temp file. After that, pipeline is resumed with temp file as stdin.
As an example, you can see the /broken/ pipeline for ca-es, installed as ca-es-estadistiques.mode
/usr/local/bin/lt-proc /usr/local/share/apertium/apertium-es-ca/ca-es.automorf.bin > $LOGSDIR$SEC.tmp; /usr/local/bin/apertium-tagger -g /usr/local/share/apertium/apertium-es-ca/ca-es.prob < $LOGSDIR$SEC.tmp \ |/usr/local/bin/apertium-pretransfer|/usr/local/bin/apertium-transfer /usr/local/share/apertium/apertium-es-ca/apertium-es-ca.trules-ca-es.xml \ /usr/local/share/apertium/apertium-es-ca/trules-ca-es.bin /usr/local/share/apertium/apertium-es-ca/ca-es.autobil.bin \ |/usr/local/bin/lt-proc $1 /usr/local/share/apertium/apertium-es-ca/ca-es.autogen.bin \ |/usr/local/bin/lt-proc -p /usr/local/share/apertium/apertium-es-ca/ca-es.autogen.bin
And an example of calling apertium with this mode would be the following
LOGSDIR=~/logs/apertium/; SEC=`date +%s`; echo "Ara Apertium permet extraure estadístiques" | apertium ca-es-estadistiques
In that example, $LOGSDIR is a folder where the logs will be saved, and $SEC is an unique ID for that log.
When translation is done, we can process the log created in order to get statistics.
Mixed modes[edit]
See Mixed modes
See also[edit]
- https://victorio.uit.no/langtech/trunk/tools/CorpusTools/corpustools/modes.py is a Python module that parses modes files into a
Pipeline
class