SFST

From Apertium
Jump to navigation Jump to search

SFST (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers.

Downloading

A packaged version, with the fst-proc program for processing Apertium input streams can be downloaded from Apertium SVN:

$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/branches/sfst
Compiling

Follow the standard steps:

$ sh autogen.sh
$ ./configure
$ make
$ make install

Mac users: first do

$ sudo port install libtool 
$ sudo ln -s /opt/local/bin/glibtoolize /bin/libtoolize # or wherever

(also, the call to basename(name) in src/fst-proc.C needs to be changed to eg. name if you want to compile without libiberty sources lying around)


Usage

To try SFST out, you can start by compiling the German transducer, SMOR, that comes with the package:

$ cd data/SMOR
$ make

Wait some time, and you will have a file called smor.a, now you need to compact this so it can be read by fst-proc,

$ fst-compact smor.a smor.ac

Now you can use it,

$ cd ../../src
$ echo "Ich habe ein Bier" |  fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ 
^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$

It should also work with deformatters and reformatters,

$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml  |  ./fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ 
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][

Note: fst-proc currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.

Morphologies

SFST has the following morphologies available for download:

Performance

The analysers produced are fast. For a 1.3Mb analyser (SMOR), it processes ~1,100 words per second. Compare with lttoolbox which processes ~5,000 words per second.

External links