Difference between revisions of "Analysing Finnish text"

Revision as of 12:59, 19 January 2011

Installation

First make a directory called something like "source" in your home directory. The commands below assume you start in that directory.

Install SFST

$ sudo apt-get install libreadline5-dev
$ wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.2.tar.gz
$ cd SFST/src
$ make
$ sudo make install
$ sudo make libinstall
$ cd ..

Install OpenFST

$ wget http://openfst.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.2.6.tar.gz
$ tar -xzvf openfst-1.2.6.tar.gz 
$ cd openfst-1.2.6/
$ ./configure
$ make
$ sudo make install
$ cd ..

Install HFST

$ wget http://downloads.sourceforge.net/project/hfst/hfst/hfst-2.4.1.tar.gz
$ tar -xzvf hfst-2.4.1.tar.gz 
$ cd hfst-2.4.1/
$ ./configure
$ make 
$ sudo make install
$ cd ..

Install Omorfi

$ svn co http://svn.gna.org/svn/omorfi/trunk omorfi
$ cd omorfi
$ sh autogen.sh
$ ./configure 
$ make
$ sudo make install
$ cd ..

Install VISLCG

Some hints can be found here:

The main vislcg3 page is at http://beta.visl.sdu.dk/cg3.html.

Usage

Morphological analysis

Testing

$ cd omorfi/src
$ echo "auton" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
auton	auto+N+Sg+Gen
$ echo "autojen" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
autojen	auto+N+Pl+Gen

Morphological disambiguation

The analysis chain is as follows:

take text
preprocess: change it to one token per line (with proper punctuation handling)
morphological analysis
change it from omorfi output to vislcg3 input
run it through vislcg3

$ echo "Alussa loi Jumala taivaan ja maan." |  hfst-proc -C fin-sme.automorf.hfst   | vislcg3 --trace -g apertium-sme-fin.fin-sme.rlx 
VISL CG-3 Disambiguator version 0.9.7.6378
Codepage: default UTF-8, input UTF-8, output UTF-8, grammar UTF-8
Parsing grammar took 0.06 seconds.
Grammar has 6 sections, 0 templates, 1379 rules, 1312 sets, 748 c-tags, 1299 s-tags.
"<Alussa>"
	"alku" N Sg Ine 
"<loi>"
	"luoda" V Act Ind Prt Sg3 @+FMAINV MAP:2859 
"<Jumala>"
	"Jumala" N Prop Sg Nom @→N MAP:2694 
	"jumala" N Sg Nom @→N MAP:2694 
"<taivaan>"
	"taivas" N Sg Gen @→N MAP:2680 
"<ja>"
	"ja" CC @CNP MAP:2651 
;	"ja" Pcle REMOVE:820 
"<maan>"
	"maa" N Sg Gen @←OBJ MAP:2785 
"<.>"
	"." Punct CLB ADD:793

Difference between revisions of "Analysing Finnish text"

Revision as of 12:59, 19 January 2011

Contents

Installation

Install SFST

Install OpenFST

Install HFST

Install Omorfi

Install VISLCG

Usage

Morphological analysis

Testing

Morphological disambiguation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools