Difference between revisions of "Analysing Finnish text"

From Apertium
Jump to navigation Jump to search
(Nouveau nom pour la page en français)
(Redirected page to Finnish)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
#redirect[[Finnish]]
[[Analyser un texte finnois|En français]]

{{TOCD}}

==Installation==

First make a directory called something like "source" in your home directory. The commands below assume you start in that directory.

===Install SFST===

Note: you might have to uncomment the FPIC line in Makefile, in order to avoid a relocation error in sudo make libinstall.

<pre>
$ sudo apt-get install libreadline5-dev
$ wget ftp://ftp.ims.uni-stuttgart.de/pub/corpora/SFST/SFST-1.4.6a.tar.gz
$ tar -xzvf SFST-1.4.6a.tar.gz
$ cd SFST/src
$ make
$ sudo make install
$ sudo make libinstall
$ cd ..
</pre>

===Install OpenFST===

<pre>
$ wget http://openfst.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.2.6.tar.gz
$ tar -xzvf openfst-1.2.6.tar.gz
$ cd openfst-1.2.6/
$ ./configure
$ make
$ sudo make install
$ cd ..
</pre>

===Install HFST===
<!--
<pre>
$ svn co https://hfst.svn.sourceforge.net/svnroot/hfst/trunk hfst
$ cd hfst/hfst3
$ sh autogen.sh
$ ./configure --without-foma
$ make
$ sudo make install
$ cd ..
</pre>
-->

<pre>
$ wget http://downloads.sourceforge.net/project/hfst/hfst/hfst-2.4.1.tar.gz
$ tar -xzvf hfst-2.4.1.tar.gz
$ cd hfst-2.4.1/
$ ./configure
$ make
$ sudo make install
$ cd ..
</pre>

===Install Omorfi===

<pre>
$ svn co http://svn.gna.org/svn/omorfi/trunk omorfi
$ cd omorfi
$ sh autogen.sh
$ ./configure
$ make
$ sudo make install
$ cd ..
</pre>

===Install VISLCG===

Some hints can be found here:
* [http://giellatekno.uit.no/doc/tools/docu-vislcg3.html Giellatekno's vislcg3 installation page]
* [http://giellatekno.uit.no/doc/tools/cg3-usage.html Giellatekno's vislcg3 usage page]

The main vislcg3 page is at [http://beta.visl.sdu.dk/cg3.html http://beta.visl.sdu.dk/cg3.html].

==Usage==

===Morphological analysis===

====Testing====

<pre>
$ cd omorfi/src
$ echo "auton" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
auton auto+N+Sg+Gen
$ echo "autojen" | hfst-optimized-lookup mor-omorfi.cg.hfst.ol
autojen auto+N+Pl+Gen

</pre>

===Morphological disambiguation===

The analysis chain is as follows:

# take text
# preprocess: change it to one token per line (with proper punctuation handling)
# morphological analysis
# change it from omorfi output to vislcg3 input
# run it through vislcg3

In the following, apertium-style example, points 2, 3, 4 above have been included in one operation, ''hfst-proc''. In [http://giellatekno.uit.no/doc/tools/docu-sme-manual.html the Giellatekno environment] they are run separately. The command below is run with the ''--trace'' option, giving the type (MAP, REMOVE, etc.) and line number in the cg file (here ''apertium-sme-fin.fin-sme.rlx'', cf. the message ''MAP:2859'', indicating that the addition of @+FMAINV tag is done with a mapping rule on line 2859). Running the same command but without the ''--trace'' option will give a clean output., but with less info.

<pre>
$ echo "Alussa loi Jumala taivaan ja maan." | hfst-proc -C fin-sme.automorf.hfst | vislcg3 --trace -g apertium-sme-fin.fin-sme.rlx
VISL CG-3 Disambiguator version 0.9.7.6378
Codepage: default UTF-8, input UTF-8, output UTF-8, grammar UTF-8
Parsing grammar took 0.06 seconds.
Grammar has 6 sections, 0 templates, 1379 rules, 1312 sets, 748 c-tags, 1299 s-tags.
"<Alussa>"
"alku" N Sg Ine
"<loi>"
"luoda" V Act Ind Prt Sg3 @+FMAINV MAP:2859
"<Jumala>"
"Jumala" N Prop Sg Nom @→N MAP:2694
"jumala" N Sg Nom @→N MAP:2694
"<taivaan>"
"taivas" N Sg Gen @→N MAP:2680
"<ja>"
"ja" CC @CNP MAP:2651
; "ja" Pcle REMOVE:820
"<maan>"
"maa" N Sg Gen @←OBJ MAP:2785
"<.>"
"." Punct CLB ADD:793


</pre>


===Morphological disambiguation within the Giellatekno framework===

Note that the Finnish analysers are improved (on an admittedly slow pace) in Giellatekno's $GTHOME/langs/fin branch:

echo "Alussa loi Jumala taivaan ja maan." |preprocess|ufin|lookup2cg|vislcg3 -g main/langs/fin/src/syntax/disambiguation.cg3

<pre>
"<Alussa>"
"alku" N Sg Ine
"<loi>"
"luoda" V Act Ind Pst Sg3
"<Jumala>"
"Jumala" N Prop Sg Nom @→N
"<taivaan>"
"taivas" N Sg Gen @→N
"<ja>"
"ja" CC @CNP
"<maan>"
"maa" N Sg Gen @←OBJ
"<.>"
"." Punct CLB
</pre>


[[Category:Documentation]]
[[Category:Documentation in English]]
[[Category:Finnish]]

Latest revision as of 19:33, 18 April 2017

Redirect to: