Difference between revisions of "Hfst"

From Apertium
Jump to navigation Jump to search
m (Text replacement - "(chat|irc)\.freenode\.net" to "irc.oftc.net")
 
(34 intermediate revisions by 8 users not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}
'''hfst''' is the Helsinki finite-state toolkit. This is formalism-compatible with both lexc and twolc, so, kind of like [[foma]] is to xfst. It is currently being used in [[apertium-sme-nob]] and [[apertium-fin-sme]].
'''hfst''' is the Helsinki finite-state toolkit. This is formalism-compatible with both lexc and twolc, so, kind of like [[foma]] is to xfst. It is currently being used in [[apertium-sme-nob]], [[apertium-fin-sme]], [[apertium-kaz-tat]] and in few other pairs which involve Turkic languages.


The IRC channel is <code>#hfst</code> at <code>irc.freenode.net</code> (you may try [irc://irc.freenode.net/#hfst irc://irc.freenode.net/#hfst] if your browser supports it, or enter #hfst into http://webchat.freenode.net/ if you want a web client). The [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstHome HFST Wiki] has some very good documentation (see especially the page [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstReadme HfstReadme] when you run into compilation problems).
The IRC channel is <code>#hfst</code> at <code>irc.oftc.net</code> (you may try [irc://irc.oftc.net/#hfst irc://irc.oftc.net/#hfst] if your browser supports it, or enter #hfst into https://webchat.oftc.net/ if you want a web client). The [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstHome HFST Wiki] has some very good documentation (see especially the page [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstReadme HfstReadme] when you run into compilation problems).


HFST is actually created as a set of wrappers over several possible ''back-ends'', [[Foma]], [[OpenFST]], [[SFST]], …. If you want to use HFST for anything serious, you'll need at least one of these back-ends installed. Fortunately, the latest versions of HFST also include the back-ends, and will install whichever back-end you need along with HFST itself.
HFST is actually created as a set of wrappers over several possible ''back-ends'', [[Foma]], [[OpenFST]], [[SFST]], …. The latest versions of HFST include the back-ends you need, so there's no reason to install any of these backends separately.


{{Github-migration-check}}
==Building and installing HFST==
==Building and installing HFST==

<span style="color: #f00;">See [[Installation]], for most real operating systems you can now get pre-built packages of HFST (as well as other core tools) through your regular package manager.</span>


If you wish to hack on the HFST C++ code itself (or you are on some system that doesn't have packages yet), you can follow this procedure:

===Install prerequisites===
===Install prerequisites===


You will need the regular build dependencies:
You will need the regular build dependencies:
* <code>automake, autoconf, libtool, flex, bison, g++</code>
* <code>automake, autoconf, libtool, flex, bison, g++, libreadline-dev</code>


If you've already installed apertium/lttoolbox these should be installed already; if not, they should be easily installable with your package manager, e.g.
If you've already installed apertium/lttoolbox these should be installed already; if not, they should be easily installable with your package manager, e.g.
** Ubuntu: <code>sudo apt-get install automake autoconf libtool flex bison g++</code>
* Ubuntu: <code>sudo apt-get install automake autoconf libtool flex bison g++ libreadline-dev</code>
** Arch Linux: <code>sudo pacman -S base-devel</code>
* Arch Linux: <code>sudo pacman -S base-devel</code>
* MacOS X users should install the general [[Prerequisites_for_Mac_OS_X]] first, then <code>sudo port install bison readline</code>

MacOS X users might need to install [http://developer.apple.com/ XCode] (registration required).


===Download HFST===
===Download HFST===


Either use the latest release (recommended for users), or go with the bleeding-edge SVN version (recommended for developers).
Either use the latest release (recommended for users), or go with the bleeding-edge Git version (recommended for developers).


====From SVN====
====From Git repository====


<pre>
<pre>
$ git clone https://github.com/hfst/hfst.git
$ svn co svn://svn.code.sf.net/p/hfst/code/trunk/hfst3 hfst3
$ cd hfst3/
$ cd hfst/
$ ./autogen.sh
$ autoreconf -i
$ ./configure
$ make
</pre>
</pre>


(The autogen step is only needed when using Git, not with the tarball.)
(Only SVN users need the autoreconf step.)


====Released tarball====
====Released tarball====


Download the latest release, named something like hfst-X.Y.Z.tar.gz, from [http://sourceforge.net/projects/hfst/files/], then
Download the latest release, named something like hfst-X.Y.Z.tar.gz, from https://github.com/hfst/hfst/releases, then
<pre>
<pre>
$ tar -xzf hfst-X.Y.Z.tgz
$ tar -xzf hfst-X.Y.Z.tgz
Line 43: Line 51:
===Configure===
===Configure===


In the configure step, you can turn on/off features and backends and such. <small>The [[OpenFST]] backend is included in the HFST distribution, while [[foma]] and [[SFST]] are not and are not recommended since they typically lead to more trouble than it's worth.</small>
In the configure step, you'll need to decide which back-ends you want to include.


For most users, this should work:
'''Required back-ends:'''
* [[OpenFST]]
** Included by default.

'''Semi-Optional Backends:'''
* [[Foma]] – used for lexc and xfst (sequential rewrite rules)
** pass <code>--enable-lexc --with-foma</code> to ./configure to use this
** IF YOU PLAN ON COMPILING ANY LEXC FILES, THIS IS BASICALLY MANDATORY

'''Optional Backends:'''
* [[SFST]] – makes hfst-substitute faster
** pass <code>--with-sfst</code> to ./configure to use this
** DON'T INSTALL THIS UNLESS YOU REALLY NEED TO. IT MAKES EVERYTHING HARDER, NOT EASIER

For most users, this is enough:
<pre>
<pre>
$ ./configure --with-foma --enable-lexc
$ ./configure --enable-proc --without-foma --enable-lexc --enable-all-tools
</pre>
</pre>


Line 68: Line 62:
If you want hfst and back-ends installed somewhere else, you can do
If you want hfst and back-ends installed somewhere else, you can do
<pre>
<pre>
$ ./configure --with-foma --enable-lexc --prefix=/home/USERNAME/local/
$ ./configure --enable-proc --without-foma --enable-lexc --enable-all-tools --prefix=/home/USERNAME/local/
</pre>
</pre>


'''Note: When we say USERNAME we mean your username, you need to replace it with your username, if you don't know what it is, you can find out by typing <code>whoami</code>'''
'''Note: When we say USERNAME we mean your username, you need to replace it with your username, if you don't know what it is, you can find out by typing <code>whoami</code>'''



You can also add <code>--with-unicode-handler=glib</code> (or <code>--with-unicode-handler=ICU</code>) to the ./configure step if you have glib (or ICU) installed and want better Unicode [https://en.wikipedia.org/wiki/Case_folding#Case_folding Case_folding].
You can also add <code>--with-unicode-handler=glib</code> (or <code>--with-unicode-handler=ICU</code>) to the ./configure step if you have glib (or ICU) installed and want better Unicode [https://en.wikipedia.org/wiki/Case_folding#Case_folding Case_folding].



===Compile and install===
===Compile and install===
If your autotools version is older than 1.14 (check with <code>automake --version</code>), first do:
Build the package by running
<pre>$ scripts/generate-cc-files.sh</pre>

Build by running
<pre>$ make</pre>
<pre>$ make</pre>



Then you need to install (Note: you need to use <code>sudo make install</code> if you installed it in /usr/local (or did not give a --prefix in the configure step); otherwise, no sudo!)
Then you need to install (Note: you need to use <code>sudo make install</code> if you installed it in /usr/local (or did not give a --prefix in the configure step); otherwise, no sudo!)
Line 89: Line 87:
$ sudo ldconfig
$ sudo ldconfig
</pre>
</pre>



==Troubleshooting==
==Troubleshooting==
When doing "make" with old autotools (pre 1.14?)
<pre>make[5]: *** No rule to make target `xre_parse.hh', needed by `xre_lex.ll'. Stop.</pre>
Run <code>scripts/generate-cc-files.sh</code> and then make again.



If, during the ./configure step, you see<pre>checking for GNU libc compatible malloc... no
If, during the ./configure step, you see<pre>checking for GNU libc compatible malloc... no
Line 110: Line 111:
make CXXFLAGS=-fpermissive
make CXXFLAGS=-fpermissive
</pre>
</pre>


If, when compiling a dictionary, you end up in a "foma" prompt where you can type stuff, you should remove anything related to foma or "hfst-xfst" from your system, and build HFST anew as described above.




For more advices on installation problems, have a look at [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstReadme the Hfst Readme page].
For more advices on installation problems, have a look at [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstReadme the Hfst Readme page].


See also [[Foma]], [[OpenFST]] and [[SFST]] for compilation problems regarding the back-ends.
See also [[Foma]], [[OpenFST]] and [[SFST]] for problems regarding the back-ends.


==Using==
==Using==


<pre>
<pre>
$ svn co https://victorio.uit.no/langtech/trunk/st/fao
$ svn co https://victorio.uit.no/langtech/trunk/langs/fao
$ cd fao/src
$ cd fao/src
$ make -f Makefile.hfst
$ make -f Makefile.hfst
Line 168: Line 172:
$ hfst-invert fao-gen.hfst -o fao-morph.hfst
$ hfst-invert fao-gen.hfst -o fao-morph.hfst
</pre>
</pre>

==HFST2 vs HFST3==
There have been some changes. Notably:

* In twol files, a <code>/</code> in alphabetic symbols has to be escaped, e.g. <code>%+Der%/st</code> instead of <code>%+Der/st</code>.
* In twol files, you can no longer have Sets on the left-hand side of a rule, so write <code>Vx:Vy /<= _ ; where Vx in Set1 Vy in Set2 ;</code> where you before would have <code>Set1:Set2 /<= _ ;</code>

* The old <code>-r</code> option to hfst-twolc is now uppercase: <code>-R</code>
* hfst-lookup-optimize is gone, use instead <code>hfst-fst2fst -O -i infile.hfst -o outfile.hfst.ol</code>
* hfst-lexc needs the outfile option to be before the lexc (input), e.g. <code>hfst-lexc -o outfile.hfst mylexicon.lexc</code>
* hfst-compose-intersect uses <code>-1</code> (number one) instead of <code>-l</code> (letter L), and <code>-2</code> for the rule-file. E.g. <code>hfst-compose-intersect -1 lexicon.hfst -2 rules.twol.hfst -o generator.hfst</code>


==See also==
==See also==

Latest revision as of 06:25, 27 May 2021

hfst is the Helsinki finite-state toolkit. This is formalism-compatible with both lexc and twolc, so, kind of like foma is to xfst. It is currently being used in apertium-sme-nob, apertium-fin-sme, apertium-kaz-tat and in few other pairs which involve Turkic languages.

The IRC channel is #hfst at irc.oftc.net (you may try irc://irc.oftc.net/#hfst if your browser supports it, or enter #hfst into https://webchat.oftc.net/ if you want a web client). The HFST Wiki has some very good documentation (see especially the page HfstReadme when you run into compilation problems).

HFST is actually created as a set of wrappers over several possible back-ends, Foma, OpenFST, SFST, …. The latest versions of HFST include the back-ends you need, so there's no reason to install any of these backends separately.

WARNING

This page is out of date as a result of the migration to GitHub. Please update this page with new documentation and remove this warning. If you are unsure how to proceed, please contact the GitHub migration team.

Building and installing HFST[edit]

See Installation, for most real operating systems you can now get pre-built packages of HFST (as well as other core tools) through your regular package manager.


If you wish to hack on the HFST C++ code itself (or you are on some system that doesn't have packages yet), you can follow this procedure:

Install prerequisites[edit]

You will need the regular build dependencies:

  • automake, autoconf, libtool, flex, bison, g++, libreadline-dev

If you've already installed apertium/lttoolbox these should be installed already; if not, they should be easily installable with your package manager, e.g.

  • Ubuntu: sudo apt-get install automake autoconf libtool flex bison g++ libreadline-dev
  • Arch Linux: sudo pacman -S base-devel
  • MacOS X users should install the general Prerequisites_for_Mac_OS_X first, then sudo port install bison readline

Download HFST[edit]

Either use the latest release (recommended for users), or go with the bleeding-edge Git version (recommended for developers).

From Git repository[edit]

$ git clone https://github.com/hfst/hfst.git
$ cd hfst/
$ ./autogen.sh
$ ./configure
$ make

(The autogen step is only needed when using Git, not with the tarball.)

Released tarball[edit]

Download the latest release, named something like hfst-X.Y.Z.tar.gz, from https://github.com/hfst/hfst/releases, then

$ tar -xzf hfst-X.Y.Z.tgz
$ cd hfst-X.Y.Z/

(replacing X.Y.Z for the version you downloaded)

Configure[edit]

In the configure step, you can turn on/off features and backends and such. The OpenFST backend is included in the HFST distribution, while foma and SFST are not and are not recommended since they typically lead to more trouble than it's worth.

For most users, this should work:

$ ./configure --enable-proc --without-foma --enable-lexc --enable-all-tools

The above command will configure it to be installed to /usr/local in the make install step (below).

If you want hfst and back-ends installed somewhere else, you can do

$ ./configure --enable-proc --without-foma --enable-lexc --enable-all-tools  --prefix=/home/USERNAME/local/

Note: When we say USERNAME we mean your username, you need to replace it with your username, if you don't know what it is, you can find out by typing whoami


You can also add --with-unicode-handler=glib (or --with-unicode-handler=ICU) to the ./configure step if you have glib (or ICU) installed and want better Unicode Case_folding.

Compile and install[edit]

If your autotools version is older than 1.14 (check with automake --version), first do:

$ scripts/generate-cc-files.sh

Build by running

$ make


Then you need to install (Note: you need to use sudo make install if you installed it in /usr/local (or did not give a --prefix in the configure step); otherwise, no sudo!)

$ make install

And finally, unless you have a Mac, you may need to do:

$ sudo ldconfig

Troubleshooting[edit]

When doing "make" with old autotools (pre 1.14?)

make[5]: *** No rule to make target `xre_parse.hh', needed by `xre_lex.ll'.  Stop.

Run scripts/generate-cc-files.sh and then make again.


If, during the ./configure step, you see

checking for GNU libc compatible malloc... no
[…]
checking for GNU libc compatible realloc... no

and then during make a bunch of errors like:

/usr/local/include/sfst/mem.h:37:57: error: 'malloc' was not declared in this scope

, try the following:

sudo ldconfig
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

and then ./configure and make.


If, during make, you see errors like

xre_parse.cc:2293:24: error: invalid conversion from 'const char*' to 'char*' [-fpermissive]

try instead

make CXXFLAGS=-fpermissive


If, when compiling a dictionary, you end up in a "foma" prompt where you can type stuff, you should remove anything related to foma or "hfst-xfst" from your system, and build HFST anew as described above.


For more advices on installation problems, have a look at the Hfst Readme page.

See also Foma, OpenFST and SFST for problems regarding the back-ends.

Using[edit]

$ svn co https://victorio.uit.no/langtech/trunk/langs/fao
$ cd fao/src
$ make -f Makefile.hfst

$ echo "orð" | hfst-lookup ../bin/fao-morph.hfst
lookup> 
orð	orð+N+Neu+Sg+Nom+Indef
orð	orð+N+Neu+Sg+Acc+Indef
orð	orð+N+Neu+Pl+Nom+Indef
orð	orð+N+Neu+Pl+Acc+Indef

lookup>
$

To compile lexc code, first concatenate all the lexc files:

$ cat fao-lex.txt noun-fao-lex.txt noun-fao-morph.txt adj-fao-lex.txt \
adj-fao-morph.txt verb-fao-lex.txt verb-fao-morph.txt adv-fao-lex.txt \
abbr-fao-lex.txt acro-fao-lex.txt pron-fao-lex.txt punct-fao-lex.txt \
numeral-fao-lex.txt pp-fao-lex.txt cc-fao-lex.txt cs-fao-lex.txt \
interj-fao-lex.txt det-fao-lex.txt > ../tmp/lexc-all.txt

To compile this, just use the hfst-lexc program,

hfst-lexc < ../tmp/lexc-all.txt > ../bin/lexc-fao.bin

To compile the twol rules, just use the hfst-twolc program,

$ hfst-twolc twol-fao.txt > twol-fao.bin

And then to compose the lexicon and rule file, use hfst-compose-intersect:

$ hfst-compose-intersect -l lexc-fao.bin twol-fao.bin -o fao-gen.hfst

This will create a generator, if you want an analyser, you just need to invert the generator with hfst-invert:

$ hfst-invert fao-gen.hfst -o fao-morph.hfst

See also[edit]

External links[edit]