Difference between revisions of "SFST"

From Apertium
Jump to navigation Jump to search
m (and it compiles.)
Line 26: Line 26:
 
$ sudo ln -s /opt/local/bin/glibtoolize /bin/libtoolize # or wherever
 
$ sudo ln -s /opt/local/bin/glibtoolize /bin/libtoolize # or wherever
 
</pre>
 
</pre>
  +
(also, the call to <code>basename(name)</code> in src/fst-proc.C needs to be changed to eg. <code>name</code> if you want to compile without libiberty sources lying around)
  +
   
 
==Usage==
 
==Usage==

Revision as of 12:43, 31 March 2009

SFST (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers.

Downloading

A packaged version, with the fst-proc program for processing Apertium input streams can be downloaded from Apertium SVN:

$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/branches/sfst
Compiling

Follow the standard steps:

$ sh autogen.sh
$ ./configure
$ make
$ make install

Mac users: first do

$ sudo port install libtool 
$ sudo ln -s /opt/local/bin/glibtoolize /bin/libtoolize # or wherever

(also, the call to basename(name) in src/fst-proc.C needs to be changed to eg. name if you want to compile without libiberty sources lying around)


Usage

To try SFST out, you can start by compiling the German transducer, SMOR, that comes with the package:

$ cd data/SMOR
$ make

Wait some time, and you will have a file called morph.a, now you need to compact this so it can be read by fst-proc,

$ fst-compact morph.a morph.ac

Now you can use it,

$ cd ../../src
$ echo "Ich habe ein Bier" |  fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ 
^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$

It should also work with deformatters and reformatters,

$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml  |  ./fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ 
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][

Note: fst-proc currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.

Morphologies

SFST has the following morphologies available for download:

Performance

The analysers produced are fast. For a 1.3Mb analyser (SMOR), it processes ~1,100 words per second. Compare with lttoolbox which processes ~5,000 words per second.

External links