SFST
SFST (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. It is also one of the possible backends for HFST.
Installation
Prerequisites
On Ubuntu/Debian:
sudo apt-get install libreadline5-dev
Download, compile, install
wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.2.tar.gz tar -xzvf SFST-1.4.2.tar.gz cd SFST/src make sudo make install sudo make libinstall
Note: on Arch Linux, you may have to do make CFLAGS=-fpic
instead of make.
Packaged version with fst-proc
A packaged version, with the fst-proc
program for processing Apertium input streams can be downloaded from Apertium SVN:
- this is now superseded by HFST + hfst-proc, right?
$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/branches/sfst
- Compiling
Follow the standard steps:
$ sh autogen.sh $ ./configure $ make $ make install
Mac users: first do
$ sudo port install libtool $ sudo ln -s /opt/local/bin/glibtoolize /bin/libtoolize # or wherever
(also, the call to basename(name)
in src/fst-proc.C needs to be changed to eg. name
if you want to compile without libiberty sources lying around)
Usage
To try SFST out, you can start by compiling the German transducer, SMOR, that comes with the package:
$ cd data/SMOR $ make
Wait some time, and you will have a file called smor.a
, now you need to compact this so it can be read by fst-proc
,
$ fst-compact smor.a smor.ac
Now you can use it,
$ cd ../../src $ echo "Ich habe ein Bier" | fst-proc ../data/SMOR/smor.ac ^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ ^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ ^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$
It should also work with deformatters and reformatters,
$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml | ./fst-proc ../data/SMOR/smor.ac ^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ ^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ <em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][
Note: fst-proc
currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.
Morphologies
SFST has the following morphologies available for download:
- Morph-IT! (Italian, 34,968 lemmas, LGPL)
- Omorfi–SFST implementation of word form morphology of Finnish (Finnish, 93,510 lemmas, LGPL)
- For further information see: Omorfi
- SMOR — comes in the SFST distribution (German, 1,096 lemmas, GPL)
- trmorph (Turkish, wide coverage)
- For further information see: Turkish
Performance
The analysers produced are fast. For a 1.3Mb analyser (SMOR), it processes ~1,100 words per second. Compare with lttoolbox which processes ~5,000 words per second.