Difference between revisions of "Analysing Finnish text"

From Apertium
Jump to navigation Jump to search
(Link to French page)
Line 1: Line 1:
[[Analyser un texte finois|En français]]

{{TOCD}}
{{TOCD}}



Revision as of 14:37, 7 October 2014

En français

Installation

First make a directory called something like "source" in your home directory. The commands below assume you start in that directory.

Install SFST

Note: you might have to uncomment the FPIC line in Makefile, in order to avoid a relocation error in sudo make libinstall.

$ sudo apt-get install libreadline5-dev
$ wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6a.tar.gz
$ tar -xzvf SFST-1.4.6a.tar.gz
$ cd SFST/src
$ make
$ sudo make install
$ sudo make libinstall
$ cd ..

Install OpenFST

$ wget http://openfst.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.2.6.tar.gz
$ tar -xzvf openfst-1.2.6.tar.gz 
$ cd openfst-1.2.6/
$ ./configure
$ make
$ sudo make install
$ cd ..

Install HFST

$ wget http://downloads.sourceforge.net/project/hfst/hfst/hfst-2.4.1.tar.gz
$ tar -xzvf hfst-2.4.1.tar.gz 
$ cd hfst-2.4.1/
$ ./configure
$ make 
$ sudo make install
$ cd ..

Install Omorfi

$ svn co http://svn.gna.org/svn/omorfi/trunk omorfi
$ cd omorfi
$ sh autogen.sh
$ ./configure 
$ make
$ sudo make install
$ cd ..

Install VISLCG

Some hints can be found here:

The main vislcg3 page is at http://beta.visl.sdu.dk/cg3.html.

Usage

Morphological analysis

Testing

$ cd omorfi/src
$ echo "auton" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
auton	auto+N+Sg+Gen
$ echo "autojen" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
autojen	auto+N+Pl+Gen

Morphological disambiguation

The analysis chain is as follows:

  1. take text
  2. preprocess: change it to one token per line (with proper punctuation handling)
  3. morphological analysis
  4. change it from omorfi output to vislcg3 input
  5. run it through vislcg3

In the following, apertium-style example, points 2, 3, 4 above have been included in one operation, hfst-proc. In the Giellatekno environment they are run separately. The command below is run with the --trace option, giving the type (MAP, REMOVE, etc.) and line number in the cg file (here apertium-sme-fin.fin-sme.rlx, cf. the message MAP:2859, indicating that the addition of @+FMAINV tag is done with a mapping rule on line 2859). Running the same command but without the --trace option will give a clean output., but with less info.

$ echo "Alussa loi Jumala taivaan ja maan." |  hfst-proc -C fin-sme.automorf.hfst   | vislcg3 --trace -g apertium-sme-fin.fin-sme.rlx 
VISL CG-3 Disambiguator version 0.9.7.6378
Codepage: default UTF-8, input UTF-8, output UTF-8, grammar UTF-8
Parsing grammar took 0.06 seconds.
Grammar has 6 sections, 0 templates, 1379 rules, 1312 sets, 748 c-tags, 1299 s-tags.
"<Alussa>"
	"alku" N Sg Ine 
"<loi>"
	"luoda" V Act Ind Prt Sg3 @+FMAINV MAP:2859 
"<Jumala>"
	"Jumala" N Prop Sg Nom @→N MAP:2694 
	"jumala" N Sg Nom @→N MAP:2694 
"<taivaan>"
	"taivas" N Sg Gen @→N MAP:2680 
"<ja>"
	"ja" CC @CNP MAP:2651 
;	"ja" Pcle REMOVE:820 
"<maan>"
	"maa" N Sg Gen @←OBJ MAP:2785 
"<.>"
	"." Punct CLB ADD:793 



Morphological disambiguation within the Giellatekno framework

Note that the Finnish analysers are improved (on an admittedly slow pace) in Giellatekno's $GTHOME/langs/fin branch:

echo "Alussa loi Jumala taivaan ja maan." |preprocess|ufin|lookup2cg|vislcg3 -g main/langs/fin/src/syntax/disambiguation.cg3 
"<Alussa>"
	"alku" N Sg Ine 
"<loi>"
	"luoda" V Act Ind Pst Sg3 
"<Jumala>"
	"Jumala" N Prop Sg Nom @→N 
"<taivaan>"
	"taivas" N Sg Gen @→N 
"<ja>"
	"ja" CC @CNP 
"<maan>"
	"maa" N Sg Gen @←OBJ 
"<.>"
	"." Punct CLB