Difference between revisions of "Voikkospell"

From Apertium
Jump to navigation Jump to search
Line 97: Line 97:
 
To use <code>[</code> and <code>]</code> as characters, one must escape them.
 
To use <code>[</code> and <code>]</code> as characters, one must escape them.
   
===Examples===
+
===An HTML Example===
  +
Let's spellcheck the following webpage:
====Trailing Newline====
 
<pre>
 
$ echo '' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedReservedCharacter'
 
what(): 1:1: unexpected '\n', '\n' expected to follow '['
 
 
 
^
 
Aborted
 
</pre>
 
 
For this reason, when piping text directly to <code>voikkospell --apertium-stream</code>, use <code>echo -n</code>. It is not necessary to do this when piping through tools such as <code>apertium-deshtml</code>, which encapsulate all newlines in superblanks.
 
 
One could also escape the newline:
 
 
<pre>
 
$ echo '\' | voikkospell --apertium-stream
 
</pre>
 
 
====Unanalysed Word====
 
<pre>
 
$ echo -n '^a/*a$' | voikkospell --apertium-stream
 
W: a
 
</pre>
 
 
====Analysed Words====
 
=====One Tag=====
 
<pre>
 
$ echo -n '^b/b<A>$' | voikkospell --apertium-stream
 
W: b
 
</pre>
 
 
=====More Than One Tag=====
 
<pre>
 
$ echo -n '^c/c<B><C>$' | voikkospell --apertium-stream
 
W: c
 
</pre>
 
 
====Ambiguous Word====
 
<pre>
 
$ echo -n '^d/d<D>/d<E><F>$' | voikkospell --apertium-stream
 
W: d
 
</pre>
 
 
====Multiwords====
 
=====One Word with Inner Inflection=====
 
<pre>
 
$ echo -n '^e f/e<G># f/e<H><I># f$' | voikkospell --apertium-stream
 
W: e f
 
</pre>
 
 
=====More Than One Word=====
 
======Without Inner Inflection======
 
<pre>
 
$ echo -n 'gh/g<J>+h<K><L>/g<M>+h<N>$' | voikkospell --apertium-stream
 
W: gh
 
</pre>
 
 
======With Inner Inflection======
 
<pre>
 
$ echo -n '^i jk/i<O># j+k<P><Q>/i<R># j+k<S>$ ^lm n/l<T>+m<U># n/l<V>+m<W># n$' | \
 
voikkospell --apertium-stream
 
W: i jk
 
W: lm n
 
</pre>
 
 
====Reserved Characters====
 
<code>\</code>, <code>^</code>, <code>/</code>, <code><</code>, <code>></code>, and <code>$</code> are reserved.
 
 
=====\=====
 
<pre>
 
$ echo -n '\' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedEndOfFile'
 
what(): 1:1: unexpected end-of-file following '\', end-of-file expected to fo
 
llow ']' or '$'
 
\
 
^
 
Aborted
 
</pre>
 
 
=====^=====
 
<pre>
 
$ echo -n '^' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedEndOfFile'
 
what(): 1:2: unexpected end-of-file following '^', end-of-file expected to fo
 
llow ']' or '$'
 
^
 
^
 
Aborted
 
</pre>
 
 
=====/=====
 
<pre>
 
$ echo -n '/' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedReservedCharacter'
 
what(): 1:1: unexpected '/', '/' expected to follow '[', to follow '>' immedi
 
ately, or to follow '^' or '#' not immediately
 
/
 
^
 
Aborted
 
</pre>
 
 
=====<=====
 
<pre>
 
$ echo -n '<' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedReservedCharacter'
 
what(): 1:1: unexpected '<', '<' expected to follow '[', to follow '>' immedi
 
ately, or to follow '/' or '+' not immediately
 
<
 
^
 
Aborted
 
</pre>
 
 
=====>=====
 
<pre>
 
$ echo -n '>' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedReservedCharacter'
 
what(): 1:1: unexpected '>', '>' expected to follow '[' or to follow '<' not
 
immediately
 
>
 
^
 
Aborted
 
</pre>
 
 
=====$=====
 
<pre>
 
$ echo -n '$' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedReservedCharacter'
 
what(): 1:1: unexpected '$', '$' expected to follow '[', to follow '>' immedi
 
ately, or to follow '*' or '#' not immediately
 
$
 
^
 
Aborted
 
</pre>
 
 
=====Escape=====
 
To avoid these errors, escape all reserved characters.
 
 
<pre>
 
$ echo -n '\\\^\/\<\>\$' | voikkospell --apertium-stream
 
</pre>
 
 
=====Superblank=====
 
Alternatively, one can enclose reserved characters in superblanks.
 
 
<pre>
 
$ echo -n '[^/<>$]' | voikkospell --apertium-stream
 
</pre>
 
 
However, <code>\</code> must be escaped.
 
 
<pre>
 
$ echo -n '[\]' | voikkospell --apertium-stream
 
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe
 
ctedEndOfFile'
 
what(): 1:3: unexpected end-of-file following '[', end-of-file expected to fo
 
llow ']' or '$'
 
[\]
 
^
 
Aborted
 
</pre>
 
 
====Putting It All Together====
 
 
Let's spellcheck a webpage!
 
 
voikkospell's webpage has a mixture of English and Finnish words, so we should get a good mixture of correct and incorrect spellings.
 
 
Since voikkospell only checks spelling, it doesn't matter which analyser we use. In this example, I use <code>apertium-en-ca</code>'s English analyser.
 
   
 
<pre>
 
<pre>
  +
&lt;!DOCTYPE html&gt;
$ curl -s http://voikko.puimula.org/ | apertium-deshtml | \
 
  +
&lt;html&gt;
lt-proc ~/svn.code.sf.net/p/apertium/svn/trunk/apertium-en-ca/en-ca.automorf.bin | \
 
  +
&lt;head&gt;
voikkospell --apertium-stream
 
  +
&lt;title&gt;An HTML Example&lt;/title&gt;
W: .
 
  +
&lt;/head&gt;
C: Voikko
 
  +
&lt;body&gt;
W: Free
 
  +
&lt;p&gt;
W: linguistic
 
  +
This is an HTML example.
W: software
 
  +
&lt;/p&gt;
W: for
 
  +
&lt;/body&gt;
W: Finnish
 
  +
&lt;/html&gt;
W: .
 
W: Free
 
W: linguistic
 
W: software
 
W: and
 
C: data
 
W: for
 
W: Finnish
 
W: .
 
C: Käyttäjät
 
W: Users
 
W: .
 
C: Käytä
 
C: Voikkoa
 
C: verkossa
 
W: .
 
W: Use
 
C: Voikko
 
W: online
 
W: .
 
C: Lataa
 
C: Voikon
 
C: asennuspaketti
 
W: .
 
C: Käyttö
 
C: sovellusohjelmissa
 
W: .
 
C: Käyttö
 
C: Linux
 
W: -
 
C: jakeluissa
 
W: .
 
C: Kielityökalut
 
C: LibreOfficessa
 
W: .
 
C: Usein
 
C: kysyttyjä
 
C: kysymyksiä
 
W: .
 
C: Yhteystiedot
 
W: .
 
W: Developers
 
W: .
 
W: Source
 
W: code
 
W: repositories
 
W: .
 
W: Development
 
W: wiki
 
W: .
 
W: Using
 
W: with
 
W: Java
 
W: .
 
W: Contributors
 
W: .
 
W: Contributing
 
W: .
 
C: Joukahainen
 
W: (
 
W: Finnish
 
W: vocabulary
 
W: )
 
W: .
 
C: Ohjeita
 
C: testaajille
 
W: .
 
W: Additional
 
W: reading
 
W: .
 
C: Jakelijat
 
W: Distributors
 
W: .
 
W: Source
 
C: file
 
W: releases
 
W: .
 
C: Release
 
W: notes
 
W: .
 
W: Supported
 
W: platforms
 
W: .
 
C: Linux
 
W: .
 
W: FreeBSD
 
W: .
 
W: Mac
 
W: OS
 
C: X
 
W: .
 
C: Windows
 
W: .
 
W: Architecture
 
W: and
 
W: history
 
W: .
 
W: Bugs
 
W: and
 
W: feature
 
W: requests
 
W: .
 
W: Communication
 
W: and
 
W: contact
 
W: information
 
W: .
 
C: Voikko
 
W: is
 
C: a
 
W: spelling
 
W: and
 
W: grammar
 
W: checker
 
W: ,
 
W: hyphenator
 
W: and
 
W: collection
 
W: of
 
W: related
 
W: linguistic
 
C: data
 
W: for
 
W: Finnish
 
W: language
 
W: .
 
W: Most of
 
W: the
 
W: material
 
C: on
 
W: this
 
C: web
 
W: site
 
W: is
 
W: in
 
W: English
 
W: .
 
W: Pages
 
W: written
 
W: in
 
W: Finnish
 
W: contain
 
W: information
 
W: for
 
W: end
 
W: users
 
W: who
 
W: may
 
W: not
 
W: always
 
W: understand
 
W: English
 
W: .
 
W: .
 
C: Tämä
 
C: on
 
C: Voikko
 
W: -
 
C: kielityökalujen
 
C: kotisivu
 
W: .
 
C: Voikko
 
C: on
 
C: ohjelmisto
 
C: suomen
 
C: kielen
 
C: oikeinkirjoituksen
 
C: ja
 
C: kieliopin
 
C: tarkistamiseen
 
W: ,
 
C: tavutukseen
 
C: sekä
 
C: sanojen
 
C: analysointiin
 
W: .
 
C: Tämä
 
C: sivusto
 
C: on
 
C: suurelta
 
C: osin
 
C: englanniksi
 
W: ,
 
C: koska
 
C: kaikki
 
C: Voikon
 
C: kanssa
 
C: työskentelevät
 
C: ohjelmistokehittäjät
 
C: eivät
 
C: osaa
 
C: suomea
 
W: .
 
W: .
 
C: Uutisia
 
W: News
 
W: .
 
C: 2015
 
W: -
 
C: 11
 
W: -
 
C: 12
 
W: :
 
W: Transitioning
 
W: the
 
W: Finnish
 
W: dictionary
 
W: from
 
W: Malaga
 
W: to
 
W: VFST
 
W: .
 
C: 2014
 
W: -
 
C: 01
 
W: -
 
C: 26
 
W: :
 
C: Tilastoja
 
C: vuodelta
 
C: 2013
 
C: ja
 
C: kehityssuunnitelmia
 
C: alkuvuodelle
 
C: 2014
 
W: .
 
C: 2013
 
W: -
 
C: 10
 
W: -
 
C: 07
 
W: :
 
C: Käyttäjäkyselyn
 
C: tulokset
 
C: ja
 
C: tilannepäivitystä
 
W: .
 
C: 2013
 
W: -
 
C: 02
 
W: -
 
C: 03
 
W: :
 
C: Tilastoja
 
C: vuodelta
 
C: 2012
 
C: ja
 
C: kehityssuunnitelmia
 
C: vuodelle
 
C: 2013
 
W: .
 
C: 2012
 
W: -
 
C: 08
 
W: -
 
C: 23
 
W: :
 
C: Voikko
 
W: for
 
W: Android
 
W: available
 
W: for
 
W: early
 
W: preview
 
W: .
 
C: 2012
 
W: -
 
C: 04
 
W: -
 
C: 25
 
W: :
 
C: Suomen
 
C: kielen
 
W: VFST
 
W: -
 
C: morfologian
 
C: kehitys
 
C: aloitettu
 
W: .
 
 
</pre>
 
</pre>
   

Revision as of 21:48, 17 December 2015

Installation

m5w/corevoikko, a fork of corevoikko, supports apertium stream format.

To clone it, execute the following command:

git clone https://github.com/m5w/corevoikko.git corevoikko

First, install libvoikko's dependencies. Next, execute the following commands:

cd corevoikko/libvoikko
./configure
make
sudo make install

If you do not have root privileges or would like to specify where to install libvoikko, execute the following instead: (Otherwise, you are finished with installation.)

cd corevoikko/libvoikko
PREFIX="$HOME/install/corevoikko" # e.g.
./configure --prefix="$PREFIX"
make
make install

Finally, add your "$PREFIX" to your "$PATH" by appending the following lines to your .profile:

PREFIX="$HOME/install/corevoikko" # e.g.
if [ -d "$PREFIX" ]; then
        export PATH="$PREFIX/bin:$PATH"
fi

Using voikkospell with apertium Stream Format

Invoke voikkospell with --apertium-stream. voikkospell then expects apertium-stream-formatted input instead of a list of words.

Words in apertium Stream Format

apertium stream format encodes words as lexical units. Each begins with a ^

^ . . .

and ends with a $.

^ . . . $

The word immediately follows the ^,

^word . . .$

and a / immediately follows the word. If the word is unknown, *word follows;

^word/*word$

otherwise, all the word's analyses follow, delimited by /'s.

^word/word<n><sg>/word<vblex><inf>/word<vblex><pres>$

Escaping

To use ^, $, /, <, and > as characters, one must escape them. Each escape sequence begins with a \,

\ . . .

and a character follows. voikkospell then interprets the character literally. Note that the character can be any wide character, including newlines.

To use \'s as characters, one must escape them.

Superblanks

One can also escape multiple characters not encoded in lexical units by encoding them as a superblank. Each superblank begins with a [

[ . . .

and ends with a ].

[ . . . ]

Each ^, $, /, <, and > between the [ and the ] is interpreted literally.

To use [ and ] as characters, one must escape them.

An HTML Example

Let's spellcheck the following webpage:

<!DOCTYPE html>
<html>
    <head>
        <title>An HTML Example</title>
    </head>
    <body>
        <p>
        This is an HTML example.
        </p>
    </body>
</html>