SFST

From Apertium
Jump to navigation Jump to search

SFST (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. It is also one of the possible backends for HFST.

Installation

Prerequisites

On Ubuntu/Debian:

sudo apt-get install libreadline5-dev

Download, compile, install

 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.2.tar.gz
 tar -xzvf SFST-1.4.2.tar.gz
 cd SFST/src
 make
 sudo make install
 sudo make libinstall

Note: on Arch Linux, you may have to do make CFLAGS=-fpic instead of make.

Packaged version with fst-proc

A packaged version, with the fst-proc program for processing Apertium input streams can be downloaded from Apertium SVN:

this is now superseded by HFST + hfst-proc, right?
$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/branches/sfst
Compiling

Follow the standard steps:

$ sh autogen.sh
$ ./configure
$ make
$ make install

Mac users: first do

$ sudo port install libtool 
$ sudo ln -s /opt/local/bin/glibtoolize /bin/libtoolize # or wherever

(also, the call to basename(name) in src/fst-proc.C needs to be changed to eg. name if you want to compile without libiberty sources lying around)


Usage

To try SFST out, you can start by compiling the German transducer, SMOR, that comes with the package:

$ cd data/SMOR
$ make

Wait some time, and you will have a file called smor.a, now you need to compact this so it can be read by fst-proc,

$ fst-compact smor.a smor.ac

Now you can use it,

$ cd ../../src
$ echo "Ich habe ein Bier" |  fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ 
^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$

It should also work with deformatters and reformatters,

$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml  |  ./fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ 
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][

Note: fst-proc currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.

Morphologies

SFST has the following morphologies available for download:

Performance

The analysers produced are fast. For a 1.3Mb analyser (SMOR), it processes ~1,100 words per second. Compare with lttoolbox which processes ~5,000 words per second.

External links