Difference between revisions of "How to bootstrap a new pair"
Popcorndude (talk | contribs) (remove unnecessary autogen and use apertium-get where possible) |
Popcorndude (talk | contribs) (less redundant instructions, also use apertium-get) |
||
Line 9: | Line 9: | ||
* [[apertium-init]].py – put this script in your working directory where you will be downloading language data. You can get the script from https://apertium.org/apertium-init |
* [[apertium-init]].py – put this script in your working directory where you will be downloading language data. You can get the script from https://apertium.org/apertium-init |
||
== |
==Getting the monolingual packages== |
||
For each of the two languages of the pair, if it already exists, you can download and compile it by running |
|||
⚫ | |||
First compile the monolingual packages: |
|||
<pre> |
<pre> |
||
apertium-get XXX |
apertium-get XXX |
||
apertium-get YYY |
|||
</pre> |
</pre> |
||
Where XXX is the [http://www-01.sil.org/iso639-3/codes.asp ISO 639-3] code of the language. |
|||
Then generate the pair: |
|||
<pre> |
|||
python3 apertium-init.py XXX-YYY |
|||
</pre> |
|||
If the module doesn't exist (if it doesn't appear on [https://apertium.github.io/apertium-on-github/source-browser.html this list] or apertium-get can't find it) then you can create it like this: |
|||
⚫ | |||
<pre> |
<pre> |
||
# bootstrap the module |
|||
cd apertium-XXX-YYY |
|||
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-YYY |
|||
make -j |
|||
</pre> |
|||
And test: |
|||
<pre> |
|||
echo house | apertium -d . XXX-YYY |
|||
echo Haus | apertium -d . YYY-XXX |
|||
</pre> |
|||
Now you can add words to apertium-XXX-YYY.XXX-YYY.dix, then test again: |
|||
<pre> |
|||
make -j |
|||
echo house | apertium -d . XXX-YYY |
|||
echo Haus | apertium -d . YYY-XXX |
|||
</pre> |
|||
If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory. |
|||
==With one existing monolingual package== |
|||
Does just one of the two languagues you're making a pair of already have a monolingual module in [https://apertium.github.io/apertium-on-github/source-browser.html the repository]? |
|||
Then follow this part, replacing XXX and YYY for the ISO 639-3 codes of your languages; here we assume XXX needs to be made from scratch. <br> |
|||
ISO 639-3 codes can be found here: http://www-01.sil.org/iso639-3/codes.asp |
|||
First make a new monolingual package: |
|||
<pre> |
|||
python3 apertium-init.py XXX |
python3 apertium-init.py XXX |
||
# enter the directory |
|||
cd apertium-XXX |
cd apertium-XXX |
||
⚫ | |||
make -j |
make -j |
||
cd .. |
|||
</pre> |
</pre> |
||
==Bootstrapping the pair== |
|||
Then get and compile the existing monolingual package: |
|||
<pre> |
|||
apertium-get YYY |
|||
</pre> |
|||
⚫ | |||
Then generate the pair: |
|||
<pre> |
|||
python3 apertium-init.py XXX-YYY |
|||
</pre> |
|||
Then compile the pair: |
|||
<pre> |
|||
cd apertium-XXX-YYY |
|||
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-YYY |
|||
make -j |
|||
</pre> |
|||
And test: |
|||
<pre> |
|||
echo house | apertium -d . XXX-YYY |
|||
echo Haus | apertium -d . YYY-XXX |
|||
</pre> |
|||
Now you can add words to apertium-XXX-YYY.XXX-YYY.dix, then test again: |
|||
<pre> |
|||
make -j |
|||
echo house | apertium -d . XXX-YYY |
|||
echo Haus | apertium -d . YYY-XXX |
|||
</pre> |
|||
If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory. |
|||
==With no existing monolingual packages== |
|||
Do none of the two languagues you're making a pair of already have monolingual modules in [https://apertium.github.io/apertium-on-github/source-browser.html the repository]? |
|||
Then follow this part, replacing XXX and YYY for the ISO 639-3 codes of your languages: |
|||
First make and compile the new monolingual packages: |
|||
<pre> |
|||
python3 apertium-init.py XXX |
|||
cd apertium-XXX |
|||
make -j |
|||
cd .. |
|||
python3 apertium-init.py YYY |
|||
cd apertium-YYY |
|||
make -j |
|||
cd .. |
|||
</pre> |
|||
Generate the pair: |
|||
<pre> |
<pre> |
||
python3 apertium-init.py XXX-YYY |
python3 apertium-init.py XXX-YYY |
Revision as of 16:32, 15 January 2021
How to use apertium-init to bootstrap a new language pair (optionally with new monolingual data packages as well).
Prerequisites
You need to get this installed first:
- apertium/lttoolbox/hfst, see Installation, in particular the prerequisites parts. (You most likely don't need to go all the way, since you should get this stuff from Tino's repositories. If you're on Windows, get the Apertium VirtualBox.)
- apertium-init.py – put this script in your working directory where you will be downloading language data. You can get the script from https://apertium.org/apertium-init
Getting the monolingual packages
For each of the two languages of the pair, if it already exists, you can download and compile it by running
apertium-get XXX
Where XXX is the ISO 639-3 code of the language.
If the module doesn't exist (if it doesn't appear on this list or apertium-get can't find it) then you can create it like this:
# bootstrap the module python3 apertium-init.py XXX # enter the directory cd apertium-XXX # compile the module make -j
Bootstrapping the pair
In what follows, replace XXX and YYY for the ISO 639-3 codes of your languages:
Generate the pair:
python3 apertium-init.py XXX-YYY
Then compile the pair:
cd apertium-XXX-YYY ./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-YYY make -j
And test:
echo house | apertium -d . XXX-YYY echo Haus | apertium -d . YYY-XXX
Now you can add words to apertium-XXX-YYY.XXX-YYY.dix, then test again:
make -j echo house | apertium -d . XXX-YYY echo Haus | apertium -d . YYY-XXX
If you had to add words to the monolingual dictionaries, you will have to type "make" in those directories first. Alternatively, there is a shortcut from the pair directory: "make langs" should make the monolingual dictionaries even if you're in the pair directory.
HFST and other alternative setups
If you're making a monolingual module that should use HFST/lexc, pass the option --analyser=hfst
to apertium-init.py.
If you're making a pair where the "left" side (XXX in the above examples) uses HFST/lexc, pass the option --analyser1=hfst
to apertium-init.py.
If you're making a pair where the "right" side (YYY in the above examples) uses HFST/lexc, pass the option --analyser2=hfst
to apertium-init.py.
If you're making a pair where the both sides use HFST/lexc, pass the option --analysers=hfst
to apertium-init.py.
See https://github.com/apertium/apertium-init for more documentation, or run ./apertium-init.py --help
for all options (you can e.g. also make pairs that don't use a statistical disambiguator, or don't use a Constraint Grammar disambiguator).
Choosing an Analyser
Monolingual dictionaries can be set up to use 1 of 3 available formats: monodix, lexc, and lexd.
Monodix is best for languages that have very little conjugation (such as English) or where the conjugation is entirely with suffixes but the stem doesn't change (such as Spanish). Monodix is the default setting for apertium-init or it can be explicitly specified with --analyser=lttoolbox
.
Lexc is best for languages where most of the conjugation is done with suffixes, potentially with some stem changes. Apertium-init will generate a Lexc dictionary with the option --analyser=hfst
.
Lexd should be used for any language with prefixes, infixes, circumfixes, or other morphology that isn't suffixes. Apertium-init will generate a Lexd dictionary with the option --analyser=lexd
.
If you're unsure which one to pick, Lexd will probably work well and generally involves less typing than either Monodix or Lexc.