Voikkospell

From Apertium
Revision as of 21:40, 17 December 2015 by M5w (talk | contribs) (→‎Superblanks)
Jump to navigation Jump to search

Installation

m5w/corevoikko, a fork of corevoikko, supports apertium stream format.

To clone it, execute the following command:

git clone https://github.com/m5w/corevoikko.git corevoikko

First, install libvoikko's dependencies. Next, execute the following commands:

cd corevoikko/libvoikko
./configure
make
sudo make install

If you do not have root privileges or would like to specify where to install libvoikko, execute the following instead: (Otherwise, you are finished with installation.)

cd corevoikko/libvoikko
PREFIX="$HOME/install/corevoikko" # e.g.
./configure --prefix="$PREFIX"
make
make install

Finally, add your "$PREFIX" to your "$PATH" by appending the following lines to your .profile:

PREFIX="$HOME/install/corevoikko" # e.g.
if [ -d "$PREFIX" ]; then
        export PATH="$PREFIX/bin:$PATH"
fi

Using voikkospell with apertium Stream Format

Invoke voikkospell with --apertium-stream. voikkospell then expects apertium-stream-formatted input instead of a list of words.

Words in apertium Stream Format

apertium stream format encodes words as lexical units. Each begins with a ^

^ . . .

and ends with a $.

^ . . . $

The word immediately follows the ^,

^word . . .$

and a / immediately follows the word. If the word is unknown, *word follows;

^word/*word$

otherwise, all the word's analyses follow, delimited by /'s.

^word/word<n><sg>/word<vblex><inf>/word<vblex><pres>$

Escaping

To use ^, $, /, <, and > as characters, one must escape them. Each escape sequence begins with a \,

\ . . .

and a character follows. voikkospell then interprets the character literally. Note that the character can be any wide character, including newlines.

To use \'s as characters, one must escape them.

Superblanks

One can also escape multiple characters not encoded in lexical units by encoding them as a superblank. Each superblank begins with a [

[ . . .

and ends with a ].

[ . . . ]

Each ^, $, /, <, and > between the [ and the ] is interpreted literally.

To use [ and ] as characters, one must escape them.

Examples

Trailing Newline

$ echo '' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedReservedCharacter'
  what():  1:1: unexpected '\n', '\n' expected to follow '['


^
Aborted

For this reason, when piping text directly to voikkospell --apertium-stream, use echo -n. It is not necessary to do this when piping through tools such as apertium-deshtml, which encapsulate all newlines in superblanks.

One could also escape the newline:

$ echo '\' | voikkospell --apertium-stream

Unanalysed Word

$ echo -n '^a/*a$' | voikkospell --apertium-stream
W: a

Analysed Words

One Tag
$ echo -n '^b/b<A>$' | voikkospell --apertium-stream
W: b
More Than One Tag
$ echo -n '^c/c<B><C>$' | voikkospell --apertium-stream
W: c

Ambiguous Word

$ echo -n '^d/d<D>/d<E><F>$' | voikkospell --apertium-stream
W: d

Multiwords

One Word with Inner Inflection
$ echo -n '^e f/e<G># f/e<H><I># f$' | voikkospell --apertium-stream
W: e f
More Than One Word
Without Inner Inflection
$ echo -n 'gh/g<J>+h<K><L>/g<M>+h<N>$' | voikkospell --apertium-stream
W: gh
With Inner Inflection
$ echo -n '^i jk/i<O># j+k<P><Q>/i<R># j+k<S>$ ^lm n/l<T>+m<U># n/l<V>+m<W># n$' | \
voikkospell --apertium-stream
W: i jk
W: lm n

Reserved Characters

\, ^, /, <, >, and $ are reserved.

\
$ echo -n '\' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedEndOfFile'
  what():  1:1: unexpected end-of-file following '\', end-of-file expected to fo
llow ']' or '$'
\
^
Aborted
^
$ echo -n '^' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedEndOfFile'
  what():  1:2: unexpected end-of-file following '^', end-of-file expected to fo
llow ']' or '$'
^
^
Aborted
/
$ echo -n '/' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedReservedCharacter'
  what():  1:1: unexpected '/', '/' expected to follow '[', to follow '>' immedi
ately, or to follow '^' or '#' not immediately
/
^
Aborted
<
$ echo  -n '<' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedReservedCharacter'
  what():  1:1: unexpected '<', '<' expected to follow '[', to follow '>' immedi
ately, or to follow '/' or '+' not immediately
<
^
Aborted
>
$ echo -n '>' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedReservedCharacter'
  what():  1:1: unexpected '>', '>' expected to follow '[' or to follow '<' not 
immediately
>
^
Aborted
$
$ echo -n '$' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedReservedCharacter'
  what():  1:1: unexpected '$', '$' expected to follow '[', to follow '>' immedi
ately, or to follow '*' or '#' not immediately
$
^
Aborted
Escape

To avoid these errors, escape all reserved characters.

$ echo -n '\\\^\/\<\>\$' | voikkospell --apertium-stream
Superblank

Alternatively, one can enclose reserved characters in superblanks.

$ echo -n '[^/<>$]' | voikkospell --apertium-stream

However, \ must be escaped.

$ echo -n '[\]' | voikkospell --apertium-stream
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
ctedEndOfFile'
  what():  1:3: unexpected end-of-file following '[', end-of-file expected to fo
llow ']' or '$'
[\]
  ^
Aborted

Putting It All Together

Let's spellcheck a webpage!

voikkospell's webpage has a mixture of English and Finnish words, so we should get a good mixture of correct and incorrect spellings.

Since voikkospell only checks spelling, it doesn't matter which analyser we use. In this example, I use apertium-en-ca's English analyser.

$ curl -s http://voikko.puimula.org/ | apertium-deshtml | \
lt-proc ~/svn.code.sf.net/p/apertium/svn/trunk/apertium-en-ca/en-ca.automorf.bin | \
voikkospell --apertium-stream
W: .
C: Voikko
W: Free
W: linguistic
W: software
W: for
W: Finnish
W: .
W: Free
W: linguistic
W: software
W: and
C: data
W: for
W: Finnish
W: .
C: Käyttäjät
W: Users
W: .
C: Käytä
C: Voikkoa
C: verkossa
W: .
W: Use
C: Voikko
W: online
W: .
C: Lataa
C: Voikon
C: asennuspaketti
W: .
C: Käyttö
C: sovellusohjelmissa
W: .
C: Käyttö
C: Linux
W: -
C: jakeluissa
W: .
C: Kielityökalut
C: LibreOfficessa
W: .
C: Usein
C: kysyttyjä
C: kysymyksiä
W: .
C: Yhteystiedot
W: .
W: Developers
W: .
W: Source
W: code
W: repositories
W: .
W: Development
W: wiki
W: .
W: Using
W: with
W: Java
W: .
W: Contributors
W: .
W: Contributing
W: .
C: Joukahainen
W: (
W: Finnish
W: vocabulary
W: )
W: .
C: Ohjeita
C: testaajille
W: .
W: Additional
W: reading
W: .
C: Jakelijat
W: Distributors
W: .
W: Source
C: file
W: releases
W: .
C: Release
W: notes
W: .
W: Supported
W: platforms
W: .
C: Linux
W: .
W: FreeBSD
W: .
W: Mac
W: OS
C: X
W: .
C: Windows
W: .
W: Architecture
W: and
W: history
W: .
W: Bugs
W: and
W: feature
W: requests
W: .
W: Communication
W: and
W: contact
W: information
W: .
C: Voikko
W: is
C: a
W: spelling
W: and
W: grammar
W: checker
W: ,
W: hyphenator
W: and
W: collection
W: of
W: related
W: linguistic
C: data
W: for
W: Finnish
W: language
W: .
W: Most of
W: the
W: material
C: on
W: this
C: web
W: site
W: is
W: in
W: English
W: .
W: Pages
W: written
W: in
W: Finnish
W: contain
W: information
W: for
W: end
W: users
W: who
W: may
W: not
W: always
W: understand
W: English
W: .
W: .
C: Tämä
C: on
C: Voikko
W: -
C: kielityökalujen
C: kotisivu
W: .
C: Voikko
C: on
C: ohjelmisto
C: suomen
C: kielen
C: oikeinkirjoituksen
C: ja
C: kieliopin
C: tarkistamiseen
W: ,
C: tavutukseen
C: sekä
C: sanojen
C: analysointiin
W: .
C: Tämä
C: sivusto
C: on
C: suurelta
C: osin
C: englanniksi
W: ,
C: koska
C: kaikki
C: Voikon
C: kanssa
C: työskentelevät
C: ohjelmistokehittäjät
C: eivät
C: osaa
C: suomea
W: .
W: .
C: Uutisia
W: News
W: .
C: 2015
W: -
C: 11
W: -
C: 12
W: :
W: Transitioning
W: the
W: Finnish
W: dictionary
W: from
W: Malaga
W: to
W: VFST
W: .
C: 2014
W: -
C: 01
W: -
C: 26
W: :
C: Tilastoja
C: vuodelta
C: 2013
C: ja
C: kehityssuunnitelmia
C: alkuvuodelle
C: 2014
W: .
C: 2013
W: -
C: 10
W: -
C: 07
W: :
C: Käyttäjäkyselyn
C: tulokset
C: ja
C: tilannepäivitystä
W: .
C: 2013
W: -
C: 02
W: -
C: 03
W: :
C: Tilastoja
C: vuodelta
C: 2012
C: ja
C: kehityssuunnitelmia
C: vuodelle
C: 2013
W: .
C: 2012
W: -
C: 08
W: -
C: 23
W: :
C: Voikko
W: for
W: Android
W: available
W: for
W: early
W: preview
W: .
C: 2012
W: -
C: 04
W: -
C: 25
W: :
C: Suomen
C: kielen
W: VFST
W: -
C: morfologian
C: kehitys
C: aloitettu
W: .