Difference between revisions of "SFST"

From Apertium
Jump to navigation Jump to search
 
(22 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
'''SFST''' (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers.
+
'''SFST''' (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. It is also one of the possible backends for [[HFST]].
 
==Downloading==
 
 
A packaged version, with the <code>fst-proc</code> program for processing Apertium input streams can be downloaded from Apertium SVN:
 
   
  +
==Installation==
  +
===Prerequisites===
  +
On Ubuntu/Debian:
 
<pre>
 
<pre>
  +
sudo apt-get install libreadline-dev
$ svn co http://apertium.svn.sourceforge.net/svnroot/branches/sfst
 
 
</pre>
 
</pre>
   
  +
===Download, compile, install===
;Compiling:
 
 
Follow the standard steps:
 
 
 
<pre>
 
<pre>
  +
wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6g.tar.gz
$ sh autogen.sh
 
  +
tar -xzvf SFST-1.4.6g.tar.gz
$ ./configure
 
  +
cd SFST/src
$ make
 
$ make install
+
make
  +
sudo make install
  +
sudo make libinstall
 
</pre>
 
</pre>
  +
  +
Note: on Arch Linux, you may have to do <code>make clean && make CFLAGS=-fPIC</code> instead of make.
   
 
==Usage==
 
==Usage==
Line 30: Line 30:
 
</pre>
 
</pre>
   
Wait some time, and you will have a file called <code>morph.a</code>, now you need to compact this so it can be read by <code>fst-proc</code>,
+
Wait some time, and you will have a file called <code>smor.a</code>, now you need to compact this so it can be read by <code>fst-proc</code>,
   
 
<pre>
 
<pre>
$ fst-compact morph.a morph.ac
+
$ fst-compact smor.a smor.ac
 
</pre>
 
</pre>
   
Line 40: Line 40:
 
<pre>
 
<pre>
 
$ cd ../../src
 
$ cd ../../src
$ echo "Ich habe ein Bier" | ./fst-proc ../data/SMOR/smor.ac
+
$ echo "Ich habe ein Bier" | fst-proc ../data/SMOR/smor.ac
 
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$
 
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$
 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$
 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$
Line 52: Line 52:
 
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$
 
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$
 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[
 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^bier/*bier$.[][
+
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][
 
</pre>
 
</pre>
  +
  +
Note: <code>fst-proc</code> currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.
   
 
==Morphologies==
 
==Morphologies==
Line 60: Line 62:
   
 
* [http://dev.sslmit.unibo.it/linguistics/morph-it.php Morph-IT!] (Italian, 34,968 lemmas, LGPL)
 
* [http://dev.sslmit.unibo.it/linguistics/morph-it.php Morph-IT!] (Italian, 34,968 lemmas, LGPL)
  +
* [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/OMorFiSFSTVersion Omorfi–SFST implementation of word form morphology of Finnish] (Finnish, 93,510 lemmas, LGPL)
  +
**For further information see: [[Omorfi]]
 
* SMOR &mdash; comes in the SFST distribution (German, 1,096 lemmas, GPL)
 
* SMOR &mdash; comes in the SFST distribution (German, 1,096 lemmas, GPL)
  +
* [[trmorph]] (Turkish, wide coverage)
  +
**For further information see: [[Turkish]]
   
 
==Performance==
 
==Performance==
Line 70: Line 76:
 
* [http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html SFST - Stuttgart Finite State Transducer Tools]
 
* [http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html SFST - Stuttgart Finite State Transducer Tools]
   
[[Category:Tools]]
+
[[Category:Morphological analysers]]

Latest revision as of 17:48, 30 March 2012

SFST (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. It is also one of the possible backends for HFST.

Installation[edit]

Prerequisites[edit]

On Ubuntu/Debian:

sudo apt-get install libreadline-dev

Download, compile, install[edit]

 wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6g.tar.gz
 tar -xzvf SFST-1.4.6g.tar.gz
 cd SFST/src
 make
 sudo make install
 sudo make libinstall

Note: on Arch Linux, you may have to do make clean && make CFLAGS=-fPIC instead of make.

Usage[edit]

To try SFST out, you can start by compiling the German transducer, SMOR, that comes with the package:

$ cd data/SMOR
$ make

Wait some time, and you will have a file called smor.a, now you need to compact this so it can be read by fst-proc,

$ fst-compact smor.a smor.ac

Now you can use it,

$ cd ../../src
$ echo "Ich habe ein Bier" |  fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ 
^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$

It should also work with deformatters and reformatters,

$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml  |  ./fst-proc ../data/SMOR/smor.ac
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ 
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ 
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][

Note: fst-proc currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.

Morphologies[edit]

SFST has the following morphologies available for download:

Performance[edit]

The analysers produced are fast. For a 1.3Mb analyser (SMOR), it processes ~1,100 words per second. Compare with lttoolbox which processes ~5,000 words per second.

External links[edit]