Difference between revisions of "Analysing Finnish text"

From Apertium
Jump to navigation Jump to search
Line 9: Line 9:
<pre>
<pre>
$ sudo apt-get install libreadline5-dev
$ sudo apt-get install libreadline5-dev
$ wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.2.tar.gz
$ wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6a.tar.gz
$ tar -xzvf SFST-1.4.2.tar.gz
$ tar -xzvf SFST-1.4.6a.tar.gz
$ cd SFST/src
$ cd SFST/src
$ make
$ make

Revision as of 10:28, 20 August 2011

Installation

First make a directory called something like "source" in your home directory. The commands below assume you start in that directory.

Install SFST

$ sudo apt-get install libreadline5-dev
$ wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6a.tar.gz
$ tar -xzvf SFST-1.4.6a.tar.gz
$ cd SFST/src
$ make
$ sudo make install
$ sudo make libinstall
$ cd ..

Install OpenFST

$ wget http://openfst.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.2.6.tar.gz
$ tar -xzvf openfst-1.2.6.tar.gz 
$ cd openfst-1.2.6/
$ ./configure
$ make
$ sudo make install
$ cd ..

Install HFST

$ wget http://downloads.sourceforge.net/project/hfst/hfst/hfst-2.4.1.tar.gz
$ tar -xzvf hfst-2.4.1.tar.gz 
$ cd hfst-2.4.1/
$ ./configure
$ make 
$ sudo make install
$ cd ..

Install Omorfi

$ svn co http://svn.gna.org/svn/omorfi/trunk omorfi
$ cd omorfi
$ sh autogen.sh
$ ./configure 
$ make
$ sudo make install
$ cd ..

Install VISLCG

Some hints can be found here:

The main vislcg3 page is at http://beta.visl.sdu.dk/cg3.html.

Usage

Morphological analysis

Testing

$ cd omorfi/src
$ echo "auton" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
auton	auto+N+Sg+Gen
$ echo "autojen" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
autojen	auto+N+Pl+Gen

Morphological disambiguation

The analysis chain is as follows:

  1. take text
  2. preprocess: change it to one token per line (with proper punctuation handling)
  3. morphological analysis
  4. change it from omorfi output to vislcg3 input
  5. run it through vislcg3

In the following, apertium-style example, points 2, 3, 4 above have been included in one operation, hfst-proc. In the Giellatekno environment they are run separately. The command below is run with the --trace option, giving the type (MAP, REMOVE, etc.) and line number in the cg file (here apertium-sme-fin.fin-sme.rlx, cf. the message MAP:2859, indicating that the addition of @+FMAINV tag is done with a mapping rule on line 2859). Running the same command but without the --trace option will give a clean output., but with less info.

$ echo "Alussa loi Jumala taivaan ja maan." |  hfst-proc -C fin-sme.automorf.hfst   | vislcg3 --trace -g apertium-sme-fin.fin-sme.rlx 
VISL CG-3 Disambiguator version 0.9.7.6378
Codepage: default UTF-8, input UTF-8, output UTF-8, grammar UTF-8
Parsing grammar took 0.06 seconds.
Grammar has 6 sections, 0 templates, 1379 rules, 1312 sets, 748 c-tags, 1299 s-tags.
"<Alussa>"
	"alku" N Sg Ine 
"<loi>"
	"luoda" V Act Ind Prt Sg3 @+FMAINV MAP:2859 
"<Jumala>"
	"Jumala" N Prop Sg Nom @→N MAP:2694 
	"jumala" N Sg Nom @→N MAP:2694 
"<taivaan>"
	"taivas" N Sg Gen @→N MAP:2680 
"<ja>"
	"ja" CC @CNP MAP:2651 
;	"ja" Pcle REMOVE:820 
"<maan>"
	"maa" N Sg Gen @←OBJ MAP:2785 
"<.>"
	"." Punct CLB ADD:793