Difference between revisions of "Hfst"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
 
==Prerequisites==
 
==Prerequisites==
   
  +
'''Required:'''
 
* automake, autoconf, libtool
 
* automake, autoconf, libtool
 
HFST is a sort of meta-package with several ''backends''. To do anything useful, you'll need at least one (preferably all) of:
 
 
* [[OpenFST]]
 
* [[OpenFST]]
  +
* [[SFST]] -- makes hfst-substitute a lot faster
 
  +
'''Semi-Optional Backends:'''
** remember to pass <code>--with-sfst</code> to ./configure to use this
 
 
* [[Foma]] -- used for lexc and xfst (sequential rewrite rules)
 
* [[Foma]] -- used for lexc and xfst (sequential rewrite rules)
 
** remember to pass <code>--enable-lexc --with-foma</code> to ./configure to use this
 
** remember to pass <code>--enable-lexc --with-foma</code> to ./configure to use this
  +
** IF YOU PLAN ON COMPILING ANY LEXC FILES, THIS IS BASICALLY MANDATORY
  +
  +
'''Optional Backends:'''
 
* [[SFST]] -- makes hfst-substitute a lot faster
 
** remember to pass <code>--with-sfst</code> to ./configure to use this
   
 
You can also use glib or ICU to handle Unicode operations (configure --with-unicode-handler={glib,ICU}).
 
You can also use glib or ICU to handle Unicode operations (configure --with-unicode-handler={glib,ICU}).

Revision as of 21:41, 24 November 2011

hfst is the Helsinki finite-state toolkit. This is formalism-compatible with both lexc and twolc, so, kind of like foma is to xfst. It is currently being used in apertium-sme-nob and apertium-fin-sme.

The IRC channel is #hfst at irc.freenode.net (you may try irc://irc.freenode.net/#hfst if your browser supports it, or enter #hfst into http://webchat.freenode.net/ if you want a web client). The HFST Wiki has some very good documentation (see especially the page HfstReadme when you run into compilation problems).


Prerequisites

Required:

  • automake, autoconf, libtool
  • OpenFST

Semi-Optional Backends:

  • Foma -- used for lexc and xfst (sequential rewrite rules)
    • remember to pass --enable-lexc --with-foma to ./configure to use this
    • IF YOU PLAN ON COMPILING ANY LEXC FILES, THIS IS BASICALLY MANDATORY

Optional Backends:

  • SFST -- makes hfst-substitute a lot faster
    • remember to pass --with-sfst to ./configure to use this

You can also use glib or ICU to handle Unicode operations (configure --with-unicode-handler={glib,ICU}).

Compiling HFST3

Subversion checkout

"MacOS X note: you need XCode installed on your Mac. It came with your computer, and can be downloaded from Apple (registration required)"

First we need to checkout the code from the svn

$ svn co https://hfst.svn.sourceforge.net/svnroot/hfst/trunk/hfst3

Next we need to change the directory to the downloaded one

$ cd hfst3/

And then run autoreconf -i

$ autoreconf -i

Now's the time to configure the package, this is done differently depending on which of the back ends you have installed earlier; If you've installed all of them use

$ ./configure --enable-lexc --with-foma --with-sfst --prefix=/home/USERNAME/local/

If you've only installed foma and openfst use

$ ./configure --enable-lexc --with-foma  --prefix=/home/USERNAME/local/

and if you've only installed sfst and openfst use

$ ./configure --with-sfst  --prefix=/home/USERNAME/local/

Note: If you want to install hfst in /usr/local, dump the --prefix at the end of the configure command

Now for the easier part, you need to make the package by running

$ make

then you need to install (Note: you need to add a sudo in front of the command if you installed it in /usr/local)

$ make install

and finally (this might not be necessary on your Mac)

$ sudo ldconfig

Prepackaged tarball

Download the latest version from [1], and unzip. Then follow the instructions in the README file, i.e.:

$ cd hfst-3.0/
$ sh autogen.sh
$ ./configure
$ make
$ sudo make install
$ sudo ldconfig

Troubleshooting

If, during the ./configure step, you see

checking for GNU libc compatible malloc... no
[…]
checking for GNU libc compatible realloc... no

and then during make a bunch of errors like:

/usr/local/include/sfst/mem.h:37:57: error: 'malloc' was not declared in this scope

, try the following:

sudo ldconfig
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

and then ./configure and make.


If, during make, you see errors like

xre_parse.cc:2293:24: error: invalid conversion from 'const char*' to 'char*' [-fpermissive]

try instead

make CXXFLAGS=-fpermissive


For more advices on installation problems, have a look at the Hfst Readme page.

Using

$ svn co https://victorio.uit.no/langtech/trunk/st/fao
$ cd fao/src
$ make -f Makefile.hfst

$ echo "orð" | hfst-lookup ../bin/fao-morph.hfst
lookup> 
orð	orð+N+Neu+Sg+Nom+Indef
orð	orð+N+Neu+Sg+Acc+Indef
orð	orð+N+Neu+Pl+Nom+Indef
orð	orð+N+Neu+Pl+Acc+Indef

lookup>
$

To compile lexc code, first concatenate all the lexc files:

$ cat fao-lex.txt noun-fao-lex.txt noun-fao-morph.txt adj-fao-lex.txt \
adj-fao-morph.txt verb-fao-lex.txt verb-fao-morph.txt adv-fao-lex.txt \
abbr-fao-lex.txt acro-fao-lex.txt pron-fao-lex.txt punct-fao-lex.txt \
numeral-fao-lex.txt pp-fao-lex.txt cc-fao-lex.txt cs-fao-lex.txt \
interj-fao-lex.txt det-fao-lex.txt > ../tmp/lexc-all.txt

To compile this, just use the hfst-lexc program,

hfst-lexc < ../tmp/lexc-all.txt > ../bin/lexc-fao.bin

To compile the twol rules, just use the hfst-twolc program,

$ hfst-twolc twol-fao.txt > twol-fao.bin

And then to compose the lexicon and rule file, use hfst-compose-intersect:

$ hfst-compose-intersect -l lexc-fao.bin twol-fao.bin -o fao-gen.hfst

This will create a generator, if you want an analyser, you just need to invert the generator with hfst-invert:

$ hfst-invert fao-gen.hfst -o fao-morph.hfst

HFST2 vs HFST3

There have been some changes. Notably:

  • In twol files, a / in alphabetic symbols has to be escaped, e.g. %+Der%/st instead of %+Der/st.
  • In twol files, you can no longer have Sets on the left-hand side of a rule, so write Vx:Vy /<= _ ; where Vx in Set1 Vy in Set2 ; where you before would have Set1:Set2 /<= _ ;
  • The old -r option to hfst-twolc is now uppercase: -R
  • hfst-lookup-optimize is gone, use instead hfst-fst2fst -O -i infile.hfst -o outfile.hfst.ol
  • hfst-lexc needs the outfile option to be before the lexc (input), e.g. hfst-lexc -o outfile.hfst mylexicon.lexc
  • hfst-compose-intersect uses -1 (number one) instead of -l (letter L), and -2 for the rule-file. E.g. hfst-compose-intersect -1 lexicon.hfst -2 rules.twol.hfst -o generator.hfst

See also

External links