Difference between revisions of "SFST"

From Apertium
Jump to navigation Jump to search
Line 19: Line 19:
   
 
Note: on Arch Linux, you may have to do <code>make CFLAGS=-fpic</code> instead of make.
 
Note: on Arch Linux, you may have to do <code>make CFLAGS=-fpic</code> instead of make.
 
==Packaged version with fst-proc==
 
 
A packaged version, with the <code>fst-proc</code> program for processing Apertium input streams can be downloaded from Apertium SVN:
 
: this is now superseded by [[HFST]] + hfst-proc, right?
 
 
<pre>
 
$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/branches/sfst
 
</pre>
 
 
;Compiling:
 
 
Follow the standard steps:
 
 
<pre>
 
$ sh autogen.sh
 
$ ./configure
 
$ make
 
$ make install
 
</pre>
 
 
Mac users: first do
 
<pre>
 
$ sudo port install libtool
 
$ sudo ln -s /opt/local/bin/glibtoolize /bin/libtoolize # or wherever
 
</pre>
 
(also, the call to <code>basename(name)</code> in src/fst-proc.C needs to be changed to eg. <code>name</code> if you want to compile without libiberty sources lying around)
 
 
   
 
==Usage==
 
==Usage==

Revision as of 21:37, 29 March 2011

SFST (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. It is also one of the possible backends for HFST.

Installation

Prerequisites

On Ubuntu/Debian:

sudo apt-get install libreadline5-dev

Download, compile, install

 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.2.tar.gz
 tar -xzvf SFST-1.4.2.tar.gz
 cd SFST/src
 make
 sudo make install
 sudo make libinstall

Note: on Arch Linux, you may have to do make CFLAGS=-fpic instead of make.

Usage

To try SFST out, you can start by compiling the German transducer, SMOR, that comes with the package:

$ cd data/SMOR
$ make

Wait some time, and you will have a file called smor.a, now you need to compact this so it can be read by fst-proc,

$ fst-compact smor.a smor.ac

Now you can use it,

$ cd ../../src
$ echo "Ich habe ein Bier" |  fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ 
^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$

It should also work with deformatters and reformatters,

$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml  |  ./fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ 
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][

Note: fst-proc currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.

Morphologies

SFST has the following morphologies available for download:

Performance

The analysers produced are fast. For a 1.3Mb analyser (SMOR), it processes ~1,100 words per second. Compare with lttoolbox which processes ~5,000 words per second.

External links