Difference between revisions of "SFST"
(24 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
'''SFST''' (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. |
'''SFST''' (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. It is also one of the possible backends for [[HFST]]. |
||
==Downloading== |
|||
A packaged version, with the <code>fst-proc</code> program for processing Apertium input streams can be downloaded from Apertium SVN: |
|||
==Installation== |
|||
===Prerequisites=== |
|||
On Ubuntu/Debian: |
|||
<pre> |
<pre> |
||
sudo apt-get install libreadline-dev |
|||
$ svn co http://apertium.svn.sourceforge.net/svnroot/branches/sfst |
|||
</pre> |
</pre> |
||
===Download, compile, install=== |
|||
;Compiling: |
|||
Follow the standard steps: |
|||
<pre> |
<pre> |
||
wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6g.tar.gz |
|||
$ sh autogen.sh |
|||
tar -xzvf SFST-1.4.6g.tar.gz |
|||
$ ./configure |
|||
cd SFST/src |
|||
$ make |
|||
make |
|||
sudo make install |
|||
sudo make libinstall |
|||
</pre> |
</pre> |
||
Note: on Arch Linux, you may have to do <code>make clean && make CFLAGS=-fPIC</code> instead of make. |
|||
==Usage== |
==Usage== |
||
Line 30: | Line 30: | ||
</pre> |
</pre> |
||
Wait some time, and you will have a file called <code> |
Wait some time, and you will have a file called <code>smor.a</code>, now you need to compact this so it can be read by <code>fst-proc</code>, |
||
<pre> |
<pre> |
||
$ fst-compact |
$ fst-compact smor.a smor.ac |
||
</pre> |
</pre> |
||
Line 40: | Line 40: | ||
<pre> |
<pre> |
||
$ cd ../../src |
$ cd ../../src |
||
$ echo "Ich habe ein Bier" | |
$ echo "Ich habe ein Bier" | fst-proc ../data/SMOR/smor.ac |
||
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ |
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ |
||
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ |
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ |
||
^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$ |
^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$ |
||
</pre> |
</pre> |
||
It should also work with [[deformatters and reformatters]], |
|||
<pre> |
|||
$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml | ./fst-proc ../data/SMOR/smor.ac |
|||
^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ |
|||
^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ |
|||
<em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][ |
|||
</pre> |
|||
Note: <code>fst-proc</code> currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible. |
|||
==Morphologies== |
==Morphologies== |
||
SFST has the following morphologies available for download: |
|||
The following |
|||
* [http://dev.sslmit.unibo.it/linguistics/morph-it.php Morph-IT!] (Italian, 34,968 lemmas, LGPL) |
* [http://dev.sslmit.unibo.it/linguistics/morph-it.php Morph-IT!] (Italian, 34,968 lemmas, LGPL) |
||
* [https://kitwiki.csc.fi/twiki/bin/view/KitWiki/OMorFiSFSTVersion Omorfi–SFST implementation of word form morphology of Finnish] (Finnish, 93,510 lemmas, LGPL) |
|||
**For further information see: [[Omorfi]] |
|||
* SMOR — comes in the SFST distribution (German, 1,096 lemmas, GPL) |
* SMOR — comes in the SFST distribution (German, 1,096 lemmas, GPL) |
||
* [[trmorph]] (Turkish, wide coverage) |
|||
**For further information see: [[Turkish]] |
|||
==Performance== |
==Performance== |
||
Line 61: | Line 76: | ||
* [http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html SFST - Stuttgart Finite State Transducer Tools] |
* [http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html SFST - Stuttgart Finite State Transducer Tools] |
||
[[Category: |
[[Category:Morphological analysers]] |
Latest revision as of 17:48, 30 March 2012
SFST (Stuttgart Finite State Toolkit) is a set of programs that can be used for writing morphological analysers. It is also one of the possible backends for HFST.
Installation[edit]
Prerequisites[edit]
On Ubuntu/Debian:
sudo apt-get install libreadline-dev
Download, compile, install[edit]
wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6g.tar.gz tar -xzvf SFST-1.4.6g.tar.gz cd SFST/src make sudo make install sudo make libinstall
Note: on Arch Linux, you may have to do make clean && make CFLAGS=-fPIC
instead of make.
Usage[edit]
To try SFST out, you can start by compiling the German transducer, SMOR, that comes with the package:
$ cd data/SMOR $ make
Wait some time, and you will have a file called smor.a
, now you need to compact this so it can be read by fst-proc
,
$ fst-compact smor.a smor.ac
Now you can use it,
$ cd ../../src $ echo "Ich habe ein Bier" | fst-proc ../data/SMOR/smor.ac ^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ ^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$ ^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$ ^Bier/*Bier$
It should also work with deformatters and reformatters,
$ echo "Ich habe <em>ein</em> bier" | apertium-deshtml | ./fst-proc ../data/SMOR/smor.ac ^Ich/<CAP>ich<+PPRO><pers><1><Sg><NoGend><Nom>$ ^habe/haben<+V><1><Sg><Pres><Konj>/haben<+V><3><Sg><Pres><Konj>/haben<+V><Imp><Sg>/haben<+V><1><Sg><Pres><Ind>$[ <em>]^ein/ein<+ART><Indef><Masc><Nom><Sg>/ein<+ART><Indef><Neut><Nom><Sg>/ein<+ART><Indef><Neut><Akk><Sg>$[<\/em> ]^Bier/*Bier$.[][
Note: fst-proc
currently does a rather crude tokenisation based on spaces, so multiwords currently aren't possible.
Morphologies[edit]
SFST has the following morphologies available for download:
- Morph-IT! (Italian, 34,968 lemmas, LGPL)
- Omorfi–SFST implementation of word form morphology of Finnish (Finnish, 93,510 lemmas, LGPL)
- For further information see: Omorfi
- SMOR — comes in the SFST distribution (German, 1,096 lemmas, GPL)
- trmorph (Turkish, wide coverage)
- For further information see: Turkish
Performance[edit]
The analysers produced are fast. For a 1.3Mb analyser (SMOR), it processes ~1,100 words per second. Compare with lttoolbox which processes ~5,000 words per second.