Analysing Finnish text
Installation
First make a directory called something like "source" in your home directory. The commands below assume you start in that directory.
Install SFST
$ sudo apt-get install libreadline5-dev $ wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.2.tar.gz $ cd SFST/src $ make $ sudo make install $ sudo make libinstall $ cd ..
Install OpenFST
$ wget http://openfst.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.2.6.tar.gz $ tar -xzvf openfst-1.2.6.tar.gz $ cd openfst-1.2.6/ $ ./configure $ make $ sudo make install $ cd ..
Install HFST
$ wget http://downloads.sourceforge.net/project/hfst/hfst/hfst-2.4.1.tar.gz $ tar -xzvf hfst-2.4.1.tar.gz $ cd hfst-2.4.1/ $ ./configure $ make $ sudo make install $ cd ..
Install Omorfi
$ svn co http://svn.gna.org/svn/omorfi/trunk omorfi $ cd omorfi $ sh autogen.sh $ ./configure $ make $ sudo make install $ cd ..
Install VISLCG
Some hints can be found here:
The main vislcg3 page is at http://beta.visl.sdu.dk/cg3.html.
Usage
Morphological analysis
Testing
$ cd omorfi/src $ echo "auton" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol auton auto+N+Sg+Gen $ echo "autojen" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol autojen auto+N+Pl+Gen
Morphological disambiguation
The analysis chain is as follows:
- take text
- preprocess: change it to one token per line (with proper punctuation handling)
- morphological analysis
- change it from omorfi output to vislcg3 input
- run it through vislcg3