Difference between revisions of "How to bootstrap a new pair"

From Apertium
Jump to navigation Jump to search
m (extra info about make langs)
 
(18 intermediate revisions by 9 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
 
 
How to use apertium-init to bootstrap a new language pair (optionally with new monolingual data packages as well).
 
How to use apertium-init to bootstrap a new language pair (optionally with new monolingual data packages as well).
   
Line 7: Line 6:
 
''You need to get this installed first:''
 
''You need to get this installed first:''
   
* apertium/lttoolbox/hfst, see [[Installation]], in particular the ''prerequisites'' parts. (You most likely don't need to go all the way to "minimal installation from svn", since you should get this stuff from Tino's repositories. If you're on Windows, get the virtualbox)
+
* apertium/lttoolbox/hfst, see [[Installation]], in particular the ''prerequisites'' parts. (You most likely don't need to go all the way, since you should get this stuff from Tino's repositories. If you're on Windows, get the [[Apertium VirtualBox]].)
* [[apertium-init]] – put this script in your working directory where you will be downloading language data
+
* [[apertium-init]].py – put this script in your working directory where you will be downloading language data. You can get the script from https://apertium.org/apertium-init
   
==With two existing monolingual packages==
+
==Getting the monolingual packages==
   
  +
For each of the two languages of the pair, if it already exists, you can download and compile it by running
Do the two languagues you're making a pair of already have monolingual modules in https://svn.code.sf.net/p/apertium/svn/languages/ (or perhaps https://svn.code.sf.net/p/apertium/svn/incubator )?
 
   
Then follow this part, replacing LANG1 and LANG2 for the ISO 639-3 codes of your languages:
 
 
 
First compile the monolingual packages:
 
 
<pre>
 
<pre>
  +
apertium-get XXX
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-LANG1
 
cd apertium-LANG1
 
./autogen.sh
 
make -j
 
cd ..
 
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-LANG2
 
cd apertium-LANG2
 
./autogen.sh
 
make -j
 
cd ..
 
 
</pre>
 
</pre>
   
  +
Where XXX is the [http://www-01.sil.org/iso639-3/codes.asp ISO 639-3] code of the language.
Then generate the pair:
 
<pre>
 
python3 apertium-init.py LANG1-LANG2
 
</pre>
 
   
  +
If the module doesn't exist (if it doesn't appear on [https://apertium.github.io/apertium-on-github/source-browser.html this list] or apertium-get can't find it) then you can create it like this:
Then compile the pair:
 
   
 
<pre>
 
<pre>
  +
# bootstrap the module
./autogen.sh --with-lang1=../apertium-LANG1 --with-lang2=../apertium-LANG2
 
  +
python3 apertium-init.py XXX
  +
# enter the directory
  +
cd apertium-XXX
  +
# compile the module
 
make -j
 
make -j
 
</pre>
 
</pre>
   
  +
==Bootstrapping the pair==
And test:
 
<pre>
 
echo house | apertium -d . LANG1-LANG2
 
echo Haus | apertium -d . LANG2-LANG1
 
</pre>
 
   
  +
In what follows, replace XXX and YYY for the [http://www-01.sil.org/iso639-3/codes.asp ISO 639-3] codes of your languages:
Now you can add words to apertium-LANG1-LANG2.LANG1-LANG2.dix, then test again:
 
   
  +
Generate the pair:
 
<pre>
 
<pre>
  +
python3 apertium-init.py XXX-YYY
make -j
 
echo house | apertium -d . LANG1-LANG2
 
echo Haus | apertium -d . LANG2-LANG1
 
</pre>
 
 
If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory.
 
 
==With one existing monolingual package==
 
 
Does just one of the two languagues you're making a pair of already have a monolingual module in https://svn.code.sf.net/p/apertium/svn/languages/ (or perhaps https://svn.code.sf.net/p/apertium/svn/incubator )?
 
 
Then follow this part, replacing LANG1 and LANG2 for the ISO 639-3 codes of your languages; here we assume LANG1 needs to be made from scratch.
 
 
First make a new monolingual package:
 
<pre>
 
python3 apertium-init.py LANG1
 
cd apertium-LANG1
 
./autogen.sh
 
make -j
 
cd ..
 
</pre>
 
 
Then get and compile the existing monolingual package:
 
<pre>
 
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-LANG2
 
cd apertium-LANG2
 
./autogen.sh
 
make -j
 
cd ..
 
</pre>
 
 
Then generate the pair:
 
<pre>
 
python3 apertium-init.py LANG1-LANG2
 
 
</pre>
 
</pre>
   
Line 91: Line 42:
   
 
<pre>
 
<pre>
  +
cd apertium-XXX-YYY
./autogen.sh --with-lang1=../apertium-LANG1 --with-lang2=../apertium-LANG2
 
  +
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-YYY
 
make -j
 
make -j
 
</pre>
 
</pre>
Line 97: Line 49:
 
And test:
 
And test:
 
<pre>
 
<pre>
echo house | apertium -d . LANG1-LANG2
+
echo house | apertium -d . XXX-YYY
echo Haus | apertium -d . LANG2-LANG1
+
echo Haus | apertium -d . YYY-XXX
 
</pre>
 
</pre>
   
Now you can add words to apertium-LANG1-LANG2.LANG1-LANG2.dix, then test again:
+
Now you can add words to apertium-XXX-YYY.XXX-YYY.dix, then test again:
   
 
<pre>
 
<pre>
 
make -j
 
make -j
echo house | apertium -d . LANG1-LANG2
+
echo house | apertium -d . XXX-YYY
echo Haus | apertium -d . LANG2-LANG1
+
echo Haus | apertium -d . YYY-XXX
 
</pre>
 
</pre>
   
 
If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory.
 
If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory.
   
  +
Another useful piece of information about "make langs": itʻs also possible to speed the process by "make -j8 langs", where 8 is replaced by number of CPU cores your computer has.
==With no existing monolingual packages==
 
   
  +
==Checking the tagger==
Do none of the two languagues you're making a pair of already have monolingual modules in https://svn.code.sf.net/p/apertium/svn/languages/ or https://svn.code.sf.net/p/apertium/svn/incubator ?
 
  +
Unfortunately, <code>apertium-tagger</code> will crash if given the wrong arguments. This will generally result in no output being produced when you try to translate something. Currently the best way to deal with this is to open the <code>modes.xml</code> files in the two monolingual and checking what options they give in their <code>XXX-tagger</code> and <code>YYY-tagger</code> modes.
   
  +
If you find <code>-x</code>, <code>-u 1</code>, <code>-u 2</code>, or <code>-u 3</code> after <code>apertium-tagger</code>, add it to the corresponding line of the <code>modes.xml</code> file in the bilingual directory.
Then follow this part, replacing LANG1 and LANG2 for the ISO 639-3 codes of your languages:
 
   
First make and compile the new monolingual packages:
+
If you just initialized the monolingual directory, add <code>-u 2</code> in both places.
<pre>
 
python3 apertium-init.py LANG1
 
cd apertium-LANG1
 
./autogen.sh
 
make -j
 
cd ..
 
   
  +
If the monolingual <code>modes.xml</code> does not mention <code>apertium-tagger</code>, bootstrap the pair again but add the argument <code>--no-prob1</code> or <code>--no-prob2</code>. If you have made changes that you don't want overwritten, add the argument <code>--rebuild</code> as well, otherwise it will be simplest to delete the entire directory.
python3 apertium-init.py LANG2
 
cd apertium-LANG2
 
./autogen.sh
 
make -j
 
cd ..
 
</pre>
 
   
  +
==HFST and other alternative setups==
Then generate the pair:
 
  +
If you're making a monolingual module that should use HFST/lexc, pass the option <code>--analyser=hfst</code> to apertium-init.py.
<pre>
 
python3 apertium-init.py LANG1-LANG2
 
</pre>
 
   
  +
If you're making a pair where the "left" side (XXX in the above examples) uses HFST/lexc, pass the option <code>--analyser1=hfst</code> to apertium-init.py.
Then compile the pair:
 
   
  +
If you're making a pair where the "right" side (YYY in the above examples) uses HFST/lexc, pass the option <code>--analyser2=hfst</code> to apertium-init.py.
<pre>
 
./autogen.sh --with-lang1=../apertium-LANG1 --with-lang2=../apertium-LANG2
 
make -j
 
</pre>
 
   
  +
If you're making a pair where the both sides use HFST/lexc, pass the option <code>--analysers=hfst</code> to apertium-init.py.
And test:
 
<pre>
 
echo house | apertium -d . LANG1-LANG2
 
echo Haus | apertium -d . LANG2-LANG1
 
</pre>
 
   
  +
See https://github.com/apertium/apertium-init for more documentation, or run <code>./apertium-init.py --help</code> for all options (you can e.g. also make pairs that don't use a statistical disambiguator, or don't use a Constraint Grammar disambiguator).
Now you can add words to apertium-LANG1-LANG2.LANG1-LANG2.dix, then test again:
 
   
  +
== Choosing an Analyser ==
<pre>
 
make -j
 
echo house | apertium -d . LANG1-LANG2
 
echo Haus | apertium -d . LANG2-LANG1
 
</pre>
 
   
  +
Monolingual dictionaries can be set up to use 1 of 3 available formats: monodix, lexc, and lexd.
If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory.
 
  +
  +
Monodix is best for languages that have very little conjugation (such as English) or where the conjugation is entirely with suffixes but the stem doesn't change (such as Spanish). Monodix is the default setting for apertium-init or it can be explicitly specified with <code>--analyser=lttoolbox</code>.
  +
  +
Lexc is best for languages where most of the conjugation is done with suffixes, potentially with some stem changes. Apertium-init will generate a Lexc dictionary with the option <code>--analyser=hfst</code>.
   
  +
Lexd should be used for any language with prefixes, infixes, circumfixes, or other morphology that isn't suffixes. Apertium-init will generate a Lexd dictionary with the option <code>--analyser=lexd</code>.
   
  +
If you're unsure which one to pick, Lexd will probably work well and generally involves less typing than either Monodix or Lexc.
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Installation]]
  +
[[Category:Documentation in English]]

Latest revision as of 15:30, 20 April 2021

How to use apertium-init to bootstrap a new language pair (optionally with new monolingual data packages as well).

Prerequisites[edit]

You need to get this installed first:

  • apertium/lttoolbox/hfst, see Installation, in particular the prerequisites parts. (You most likely don't need to go all the way, since you should get this stuff from Tino's repositories. If you're on Windows, get the Apertium VirtualBox.)
  • apertium-init.py – put this script in your working directory where you will be downloading language data. You can get the script from https://apertium.org/apertium-init

Getting the monolingual packages[edit]

For each of the two languages of the pair, if it already exists, you can download and compile it by running

apertium-get XXX

Where XXX is the ISO 639-3 code of the language.

If the module doesn't exist (if it doesn't appear on this list or apertium-get can't find it) then you can create it like this:

# bootstrap the module
python3 apertium-init.py XXX
# enter the directory
cd apertium-XXX
# compile the module
make -j

Bootstrapping the pair[edit]

In what follows, replace XXX and YYY for the ISO 639-3 codes of your languages:

Generate the pair:

python3 apertium-init.py XXX-YYY

Then compile the pair:

cd apertium-XXX-YYY
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-YYY
make -j

And test:

echo house | apertium -d . XXX-YYY
echo Haus | apertium -d . YYY-XXX

Now you can add words to apertium-XXX-YYY.XXX-YYY.dix, then test again:

make -j
echo house | apertium -d . XXX-YYY
echo Haus | apertium -d . YYY-XXX

If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory.

Another useful piece of information about "make langs": itʻs also possible to speed the process by "make -j8 langs", where 8 is replaced by number of CPU cores your computer has.

Checking the tagger[edit]

Unfortunately, apertium-tagger will crash if given the wrong arguments. This will generally result in no output being produced when you try to translate something. Currently the best way to deal with this is to open the modes.xml files in the two monolingual and checking what options they give in their XXX-tagger and YYY-tagger modes.

If you find -x, -u 1, -u 2, or -u 3 after apertium-tagger, add it to the corresponding line of the modes.xml file in the bilingual directory.

If you just initialized the monolingual directory, add -u 2 in both places.

If the monolingual modes.xml does not mention apertium-tagger, bootstrap the pair again but add the argument --no-prob1 or --no-prob2. If you have made changes that you don't want overwritten, add the argument --rebuild as well, otherwise it will be simplest to delete the entire directory.

HFST and other alternative setups[edit]

If you're making a monolingual module that should use HFST/lexc, pass the option --analyser=hfst to apertium-init.py.

If you're making a pair where the "left" side (XXX in the above examples) uses HFST/lexc, pass the option --analyser1=hfst to apertium-init.py.

If you're making a pair where the "right" side (YYY in the above examples) uses HFST/lexc, pass the option --analyser2=hfst to apertium-init.py.

If you're making a pair where the both sides use HFST/lexc, pass the option --analysers=hfst to apertium-init.py.

See https://github.com/apertium/apertium-init for more documentation, or run ./apertium-init.py --help for all options (you can e.g. also make pairs that don't use a statistical disambiguator, or don't use a Constraint Grammar disambiguator).

Choosing an Analyser[edit]

Monolingual dictionaries can be set up to use 1 of 3 available formats: monodix, lexc, and lexd.

Monodix is best for languages that have very little conjugation (such as English) or where the conjugation is entirely with suffixes but the stem doesn't change (such as Spanish). Monodix is the default setting for apertium-init or it can be explicitly specified with --analyser=lttoolbox.

Lexc is best for languages where most of the conjugation is done with suffixes, potentially with some stem changes. Apertium-init will generate a Lexc dictionary with the option --analyser=hfst.

Lexd should be used for any language with prefixes, infixes, circumfixes, or other morphology that isn't suffixes. Apertium-init will generate a Lexd dictionary with the option --analyser=lexd.

If you're unsure which one to pick, Lexd will probably work well and generally involves less typing than either Monodix or Lexc.