Difference between revisions of "Voikkospell"
m (→Installation) |
|||
(14 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
[https://github.com/m5w/corevoikko m5w/corevoikko], a fork of [https://github.com/voikko/corevoikko corevoikko], supports apertium stream format. |
[https://github.com/m5w/corevoikko m5w/corevoikko], a fork of [https://github.com/voikko/corevoikko corevoikko], supports apertium stream format. |
||
To clone it, execute the following command |
To clone it, execute the following command. |
||
<pre> |
<pre> |
||
Line 8: | Line 8: | ||
</pre> |
</pre> |
||
First, install [https://github.com/voikko/corevoikko/libvoikko libvoikko]'s dependencies. Next, execute the following commands |
First, install [https://github.com/voikko/corevoikko/libvoikko libvoikko]'s dependencies. Next, execute the following commands. |
||
<pre> |
<pre> |
||
Line 17: | Line 17: | ||
</pre> |
</pre> |
||
If you do not have root privileges or would like to specify where to install libvoikko, execute the following instead |
If you do not have root privileges or would like to specify where to install libvoikko, execute the following instead. (Otherwise, you are finished with installation.) |
||
<pre>cd corevoikko/libvoikko |
<pre>cd corevoikko/libvoikko |
||
Line 26: | Line 26: | ||
</pre> |
</pre> |
||
Finally, add your <code>"$PREFIX"</code> to your <code>"$PATH"</code> by appending the following lines to your <code>.profile</code> |
Finally, add your <code>"$PREFIX"</code> to your <code>"$PATH"</code> by appending the following lines to your <code>.profile</code>. |
||
<pre> |
<pre> |
||
Line 36: | Line 36: | ||
==Using voikkospell with apertium Stream Format== |
==Using voikkospell with apertium Stream Format== |
||
Invoke voikkospell with <code>--apertium-stream</code>. |
Invoke voikkospell with <code>--apertium-stream</code>. voikkospell then expects apertium-stream-formatted input instead of a list of words. |
||
===Words in apertium Stream Format=== |
===Words in apertium Stream Format=== |
||
apertium stream format encodes words |
apertium stream format encodes words as '''lexical units'''. Each begins with a <code>^</code> |
||
<pre> |
<pre> |
||
Line 51: | Line 51: | ||
</pre> |
</pre> |
||
The word immediately follows the <code>^</code> |
The word immediately follows the <code>^</code>, |
||
<pre> |
<pre> |
||
Line 57: | Line 57: | ||
</pre> |
</pre> |
||
and a <code>/</code> immediately follows the word. If the word is unknown, <code>*word</code> follows; |
|||
===Examples=== |
|||
====Trailing Newline==== |
|||
<pre> |
<pre> |
||
^word/*word$ |
|||
$ echo '' | voikkospell --apertium-stream |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
ctedReservedCharacter' |
|||
what(): 1:1: unexpected '\n', '\n' expected to follow '[' |
|||
^ |
|||
Aborted |
|||
</pre> |
</pre> |
||
otherwise, all the word's analyses follow, delimited by <code>/</code>'s. |
|||
For this reason, when piping text directly to <code>voikkospell --apertium-stream</code>, use <code>echo -n</code>. It is not necessary to do this when piping through tools such as <code>apertium-deshtml</code>, which encapsulate all newlines in superblanks. |
|||
One could also escape the newline: |
|||
<pre> |
<pre> |
||
^word/word<n><sg>/word<vblex><inf>/word<vblex><pres>$ |
|||
$ echo '\' | voikkospell --apertium-stream |
|||
</pre> |
</pre> |
||
=== |
===Escaping=== |
||
To use <code>^</code>, <code>$</code>, <code>/</code>, <code><</code>, and <code>></code> as characters, one must escape them. Each escape sequence begins with a <code>\</code>, |
|||
<pre> |
|||
$ echo -n '^a/*a$' | voikkospell --apertium-stream |
|||
W: a |
|||
</pre> |
|||
====Analysed Words==== |
|||
=====One Tag===== |
|||
<pre> |
<pre> |
||
\ . . . |
|||
$ echo -n '^b/b<A>$' | voikkospell --apertium-stream |
|||
W: b |
|||
</pre> |
</pre> |
||
and a character follows. voikkospell then interprets the character literally. Note that the character can be any wide character, including newlines. |
|||
=====More Than One Tag===== |
|||
<pre> |
|||
$ echo -n '^c/c<B><C>$' | voikkospell --apertium-stream |
|||
W: c |
|||
</pre> |
|||
To use <code>\</code>'s as characters, one must escape them. |
|||
====Ambiguous Word==== |
|||
<pre> |
|||
$ echo -n '^d/d<D>/d<E><F>$' | voikkospell --apertium-stream |
|||
W: d |
|||
</pre> |
|||
=== |
===Superblanks=== |
||
One can also escape multiple characters not encoded in lexical units by encoding them as a '''superblank'''. Each superblank begins with a <code>[</code> |
|||
=====One Word with Inner Inflection===== |
|||
<pre> |
|||
$ echo -n '^e f/e<G># f/e<H><I># f$' | voikkospell --apertium-stream |
|||
W: e f |
|||
</pre> |
|||
=====More Than One Word===== |
|||
======Without Inner Inflection====== |
|||
<pre> |
<pre> |
||
[ . . . |
|||
$ echo -n 'gh/g<J>+h<K><L>/g<M>+h<N>$' | voikkospell --apertium-stream |
|||
W: gh |
|||
</pre> |
</pre> |
||
and ends with a <code>]</code>. |
|||
======With Inner Inflection====== |
|||
<pre> |
|||
$ echo -n '^i jk/i<O># j+k<P><Q>/i<R># j+k<S>$ ^lm n/l<T>+m<U># n/l<V>+m<W># n$' | \ |
|||
voikkospell --apertium-stream |
|||
W: i jk |
|||
W: lm n |
|||
</pre> |
|||
====Reserved Characters==== |
|||
<code>\</code>, <code>^</code>, <code>/</code>, <code><</code>, <code>></code>, and <code>$</code> are reserved. |
|||
=====\===== |
|||
<pre> |
<pre> |
||
[ . . . ] |
|||
$ echo -n '\' | voikkospell --apertium-stream |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
ctedEndOfFile' |
|||
what(): 1:1: unexpected end-of-file following '\', end-of-file expected to fo |
|||
llow ']' or '$' |
|||
\ |
|||
^ |
|||
Aborted |
|||
</pre> |
</pre> |
||
Each <code>^</code>, <code>$</code>, <code>/</code>, <code><</code>, and <code>></code> between the <code>[</code> and the <code>]</code> is interpreted literally. |
|||
=====^===== |
|||
<pre> |
|||
$ echo -n '^' | voikkospell --apertium-stream |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
ctedEndOfFile' |
|||
what(): 1:2: unexpected end-of-file following '^', end-of-file expected to fo |
|||
llow ']' or '$' |
|||
^ |
|||
^ |
|||
Aborted |
|||
</pre> |
|||
To use <code>[</code> and <code>]</code> as characters, one must escape them. |
|||
=====/===== |
|||
<pre> |
|||
$ echo -n '/' | voikkospell --apertium-stream |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
ctedReservedCharacter' |
|||
what(): 1:1: unexpected '/', '/' expected to follow '[', to follow '>' immedi |
|||
ately, or to follow '^' or '#' not immediately |
|||
/ |
|||
^ |
|||
Aborted |
|||
</pre> |
|||
=== |
===An HTML Example=== |
||
Let's spellcheck the following webpage. |
|||
<pre> |
|||
$ echo -n '<' | voikkospell --apertium-stream |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
ctedReservedCharacter' |
|||
what(): 1:1: unexpected '<', '<' expected to follow '[', to follow '>' immedi |
|||
ately, or to follow '/' or '+' not immediately |
|||
< |
|||
^ |
|||
Aborted |
|||
</pre> |
|||
=====>===== |
|||
<pre> |
<pre> |
||
<!DOCTYPE html> |
|||
$ echo -n '>' | voikkospell --apertium-stream |
|||
<html> |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
<head> |
|||
ctedReservedCharacter' |
|||
<title>An HTML Example</title> |
|||
what(): 1:1: unexpected '>', '>' expected to follow '[' or to follow '<' not |
|||
</head> |
|||
immediately |
|||
<body> |
|||
> |
|||
<p> |
|||
^ |
|||
This is an HTML example. |
|||
Aborted |
|||
</p> |
|||
</body> |
|||
</html> |
|||
</pre> |
</pre> |
||
Running <code>apertium-deshtml</code> on it yields the following. |
|||
=====$===== |
|||
<pre> |
|||
$ echo -n '$' | voikkospell --apertium-stream |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
ctedReservedCharacter' |
|||
what(): 1:1: unexpected '$', '$' expected to follow '[', to follow '>' immedi |
|||
ately, or to follow '*' or '#' not immediately |
|||
$ |
|||
^ |
|||
Aborted |
|||
</pre> |
|||
=====Escape===== |
|||
To avoid these errors, escape all reserved characters. |
|||
<pre> |
<pre> |
||
.[][<!DOCTYPE html> |
|||
$ echo -n '\\\^\/\<\>\$' | voikkospell --apertium-stream |
|||
<html> |
|||
<head> |
|||
<title>]An HTML Example.[][<\/title> |
|||
<\/head> |
|||
<body> |
|||
<p> |
|||
]This is an HTML example..[][ |
|||
<\/p> |
|||
<\/body> |
|||
<\/html> |
|||
] |
|||
</pre> |
</pre> |
||
Note that all the <code><</code>'s, <code>></code>'s, and <code>/</code>'s are encoded as superblanks. In fact, everything except the title and body paragraph is escaped. However, those words are not yet encoded as lexical units. Running <code>lt-proc</code> on the output yields the following, suitable for <code>voikkospell --apertium-stream</code>. |
|||
=====Superblank===== |
|||
Alternatively, one can enclose reserved characters in superblanks. |
|||
<pre> |
<pre> |
||
^./.<sent>$[][<!DOCTYPE html> |
|||
$ echo -n '[^/<>$]' | voikkospell --apertium-stream |
|||
<html> |
|||
<head> |
|||
<title>]^An/A<det><ind><sg>$ ^HTML/HTML<n><acr><sp>$ ^Example/Example<n><sg>$^./.<sent>$[][<\/title> |
|||
<\/head> |
|||
<body> |
|||
<p> |
|||
]^This/This<det><dem><sg>/This<prn><tn><mf><sg>$ ^is/be<vbser><pri><p3><sg>$ ^an/a<det><ind><sg>$ ^HTML/HTML<n><acr><sp>$ ^example/example<n><sg>$^./.<sent>$^./.<sent>$[][ |
|||
<\/p> |
|||
<\/body> |
|||
<\/html> |
|||
</pre> |
</pre> |
||
Running <code>voikkospell --apertium-stream</code> on this yields the following final output. |
|||
However, <code>\</code> must be escaped. |
|||
<pre> |
<pre> |
||
^*.$[][<!DOCTYPE html> |
|||
$ echo -n '[\]' | voikkospell --apertium-stream |
|||
<html> |
|||
terminate called after throwing an instance of 'Apertium::ApertiumStream::Unexpe |
|||
<head> |
|||
ctedEndOfFile' |
|||
<title>]^*An/N/Anu/En/Ane/San$ ^HTML/HTML$ ^*Example$^*.$[][<\/title> |
|||
what(): 1:3: unexpected end-of-file following '[', end-of-file expected to fo |
|||
<\/head> |
|||
llow ']' or '$' |
|||
<body> |
|||
[\] |
|||
<p> |
|||
^ |
|||
]^*This$ ^*is/ies/iso/s/isä/i$ ^*an/en/ane/n/a/van$ ^HTML/HTML$ ^*example$^*.$^*.$[][ |
|||
Aborted |
|||
<\/p> |
|||
<\/body> |
|||
<\/html> |
|||
</pre> |
</pre> |
||
voikkospell outputs correct Finnish words like <code>^HTML/HTML$</code>; it outputs incorrect words with suggestions like <code>^*An/N/Anu/En/Ane/San$</code> and those with no suggestions like <code>^*Example$</code>. |
|||
====Putting It All Together==== |
|||
Let's spellcheck a webpage! |
|||
voikkospell's webpage has a mixture of English and Finnish words, so we should get a good mixture of correct and incorrect spellings. |
|||
Since voikkospell only checks spelling, it doesn't matter which analyser we use. In this example, I use <code>apertium-en-ca</code>'s English analyser. |
|||
<pre> |
|||
$ curl -s http://voikko.puimula.org/ | apertium-deshtml | \ |
|||
lt-proc ~/svn.code.sf.net/p/apertium/svn/trunk/apertium-en-ca/en-ca.automorf.bin | \ |
|||
voikkospell --apertium-stream |
|||
W: . |
|||
C: Voikko |
|||
W: Free |
|||
W: linguistic |
|||
W: software |
|||
W: for |
|||
W: Finnish |
|||
W: . |
|||
W: Free |
|||
W: linguistic |
|||
W: software |
|||
W: and |
|||
C: data |
|||
W: for |
|||
W: Finnish |
|||
W: . |
|||
C: Käyttäjät |
|||
W: Users |
|||
W: . |
|||
C: Käytä |
|||
C: Voikkoa |
|||
C: verkossa |
|||
W: . |
|||
W: Use |
|||
C: Voikko |
|||
W: online |
|||
W: . |
|||
C: Lataa |
|||
C: Voikon |
|||
C: asennuspaketti |
|||
W: . |
|||
C: Käyttö |
|||
C: sovellusohjelmissa |
|||
W: . |
|||
C: Käyttö |
|||
C: Linux |
|||
W: - |
|||
C: jakeluissa |
|||
W: . |
|||
C: Kielityökalut |
|||
C: LibreOfficessa |
|||
W: . |
|||
C: Usein |
|||
C: kysyttyjä |
|||
C: kysymyksiä |
|||
W: . |
|||
C: Yhteystiedot |
|||
W: . |
|||
W: Developers |
|||
W: . |
|||
W: Source |
|||
W: code |
|||
W: repositories |
|||
W: . |
|||
W: Development |
|||
W: wiki |
|||
W: . |
|||
W: Using |
|||
W: with |
|||
W: Java |
|||
W: . |
|||
W: Contributors |
|||
W: . |
|||
W: Contributing |
|||
W: . |
|||
C: Joukahainen |
|||
W: ( |
|||
W: Finnish |
|||
W: vocabulary |
|||
W: ) |
|||
W: . |
|||
C: Ohjeita |
|||
C: testaajille |
|||
W: . |
|||
W: Additional |
|||
W: reading |
|||
W: . |
|||
C: Jakelijat |
|||
W: Distributors |
|||
W: . |
|||
W: Source |
|||
C: file |
|||
W: releases |
|||
W: . |
|||
C: Release |
|||
W: notes |
|||
W: . |
|||
W: Supported |
|||
W: platforms |
|||
W: . |
|||
C: Linux |
|||
W: . |
|||
W: FreeBSD |
|||
W: . |
|||
W: Mac |
|||
W: OS |
|||
C: X |
|||
W: . |
|||
C: Windows |
|||
W: . |
|||
W: Architecture |
|||
W: and |
|||
W: history |
|||
W: . |
|||
W: Bugs |
|||
W: and |
|||
W: feature |
|||
W: requests |
|||
W: . |
|||
W: Communication |
|||
W: and |
|||
W: contact |
|||
W: information |
|||
W: . |
|||
C: Voikko |
|||
W: is |
|||
C: a |
|||
W: spelling |
|||
W: and |
|||
W: grammar |
|||
W: checker |
|||
W: , |
|||
W: hyphenator |
|||
W: and |
|||
W: collection |
|||
W: of |
|||
W: related |
|||
W: linguistic |
|||
C: data |
|||
W: for |
|||
W: Finnish |
|||
W: language |
|||
W: . |
|||
W: Most of |
|||
W: the |
|||
W: material |
|||
C: on |
|||
W: this |
|||
C: web |
|||
W: site |
|||
W: is |
|||
W: in |
|||
W: English |
|||
W: . |
|||
W: Pages |
|||
W: written |
|||
W: in |
|||
W: Finnish |
|||
W: contain |
|||
W: information |
|||
W: for |
|||
W: end |
|||
W: users |
|||
W: who |
|||
W: may |
|||
W: not |
|||
W: always |
|||
W: understand |
|||
W: English |
|||
W: . |
|||
W: . |
|||
C: Tämä |
|||
C: on |
|||
C: Voikko |
|||
W: - |
|||
C: kielityökalujen |
|||
C: kotisivu |
|||
W: . |
|||
C: Voikko |
|||
C: on |
|||
C: ohjelmisto |
|||
C: suomen |
|||
C: kielen |
|||
C: oikeinkirjoituksen |
|||
C: ja |
|||
C: kieliopin |
|||
C: tarkistamiseen |
|||
W: , |
|||
C: tavutukseen |
|||
C: sekä |
|||
C: sanojen |
|||
C: analysointiin |
|||
W: . |
|||
C: Tämä |
|||
C: sivusto |
|||
C: on |
|||
C: suurelta |
|||
C: osin |
|||
C: englanniksi |
|||
W: , |
|||
C: koska |
|||
C: kaikki |
|||
C: Voikon |
|||
C: kanssa |
|||
C: työskentelevät |
|||
C: ohjelmistokehittäjät |
|||
C: eivät |
|||
C: osaa |
|||
C: suomea |
|||
W: . |
|||
W: . |
|||
C: Uutisia |
|||
W: News |
|||
W: . |
|||
C: 2015 |
|||
W: - |
|||
C: 11 |
|||
W: - |
|||
C: 12 |
|||
W: : |
|||
W: Transitioning |
|||
W: the |
|||
W: Finnish |
|||
W: dictionary |
|||
W: from |
|||
W: Malaga |
|||
W: to |
|||
W: VFST |
|||
W: . |
|||
C: 2014 |
|||
W: - |
|||
C: 01 |
|||
W: - |
|||
C: 26 |
|||
W: : |
|||
C: Tilastoja |
|||
C: vuodelta |
|||
C: 2013 |
|||
C: ja |
|||
C: kehityssuunnitelmia |
|||
C: alkuvuodelle |
|||
C: 2014 |
|||
W: . |
|||
C: 2013 |
|||
W: - |
|||
C: 10 |
|||
W: - |
|||
C: 07 |
|||
W: : |
|||
C: Käyttäjäkyselyn |
|||
C: tulokset |
|||
C: ja |
|||
C: tilannepäivitystä |
|||
W: . |
|||
C: 2013 |
|||
W: - |
|||
C: 02 |
|||
W: - |
|||
C: 03 |
|||
W: : |
|||
C: Tilastoja |
|||
C: vuodelta |
|||
C: 2012 |
|||
C: ja |
|||
C: kehityssuunnitelmia |
|||
C: vuodelle |
|||
C: 2013 |
|||
W: . |
|||
C: 2012 |
|||
W: - |
|||
C: 08 |
|||
W: - |
|||
C: 23 |
|||
W: : |
|||
C: Voikko |
|||
W: for |
|||
W: Android |
|||
W: available |
|||
W: for |
|||
W: early |
|||
W: preview |
|||
W: . |
|||
C: 2012 |
|||
W: - |
|||
C: 04 |
|||
W: - |
|||
C: 25 |
|||
W: : |
|||
C: Suomen |
|||
C: kielen |
|||
W: VFST |
|||
W: - |
|||
C: morfologian |
|||
C: kehitys |
|||
C: aloitettu |
|||
W: . |
|||
</pre> |
|||
[[Category:Spellchecking]] |
[[Category:Spellchecking]] |
Latest revision as of 23:14, 17 December 2015
Contents
Installation[edit]
m5w/corevoikko, a fork of corevoikko, supports apertium stream format.
To clone it, execute the following command.
git clone https://github.com/m5w/corevoikko.git corevoikko
First, install libvoikko's dependencies. Next, execute the following commands.
cd corevoikko/libvoikko ./configure make sudo make install
If you do not have root privileges or would like to specify where to install libvoikko, execute the following instead. (Otherwise, you are finished with installation.)
cd corevoikko/libvoikko PREFIX="$HOME/install/corevoikko" # e.g. ./configure --prefix="$PREFIX" make make install
Finally, add your "$PREFIX"
to your "$PATH"
by appending the following lines to your .profile
.
PREFIX="$HOME/install/corevoikko" # e.g. if [ -d "$PREFIX" ]; then export PATH="$PREFIX/bin:$PATH" fi
Using voikkospell with apertium Stream Format[edit]
Invoke voikkospell with --apertium-stream
. voikkospell then expects apertium-stream-formatted input instead of a list of words.
Words in apertium Stream Format[edit]
apertium stream format encodes words as lexical units. Each begins with a ^
^ . . .
and ends with a $
.
^ . . . $
The word immediately follows the ^
,
^word . . .$
and a /
immediately follows the word. If the word is unknown, *word
follows;
^word/*word$
otherwise, all the word's analyses follow, delimited by /
's.
^word/word<n><sg>/word<vblex><inf>/word<vblex><pres>$
Escaping[edit]
To use ^
, $
, /
, <
, and >
as characters, one must escape them. Each escape sequence begins with a \
,
\ . . .
and a character follows. voikkospell then interprets the character literally. Note that the character can be any wide character, including newlines.
To use \
's as characters, one must escape them.
Superblanks[edit]
One can also escape multiple characters not encoded in lexical units by encoding them as a superblank. Each superblank begins with a [
[ . . .
and ends with a ]
.
[ . . . ]
Each ^
, $
, /
, <
, and >
between the [
and the ]
is interpreted literally.
To use [
and ]
as characters, one must escape them.
An HTML Example[edit]
Let's spellcheck the following webpage.
<!DOCTYPE html> <html> <head> <title>An HTML Example</title> </head> <body> <p> This is an HTML example. </p> </body> </html>
Running apertium-deshtml
on it yields the following.
.[][<!DOCTYPE html> <html> <head> <title>]An HTML Example.[][<\/title> <\/head> <body> <p> ]This is an HTML example..[][ <\/p> <\/body> <\/html> ]
Note that all the <
's, >
's, and /
's are encoded as superblanks. In fact, everything except the title and body paragraph is escaped. However, those words are not yet encoded as lexical units. Running lt-proc
on the output yields the following, suitable for voikkospell --apertium-stream
.
^./.<sent>$[][<!DOCTYPE html> <html> <head> <title>]^An/A<det><ind><sg>$ ^HTML/HTML<n><acr><sp>$ ^Example/Example<n><sg>$^./.<sent>$[][<\/title> <\/head> <body> <p> ]^This/This<det><dem><sg>/This<prn><tn><mf><sg>$ ^is/be<vbser><pri><p3><sg>$ ^an/a<det><ind><sg>$ ^HTML/HTML<n><acr><sp>$ ^example/example<n><sg>$^./.<sent>$^./.<sent>$[][ <\/p> <\/body> <\/html>
Running voikkospell --apertium-stream
on this yields the following final output.
^*.$[][<!DOCTYPE html> <html> <head> <title>]^*An/N/Anu/En/Ane/San$ ^HTML/HTML$ ^*Example$^*.$[][<\/title> <\/head> <body> <p> ]^*This$ ^*is/ies/iso/s/isä/i$ ^*an/en/ane/n/a/van$ ^HTML/HTML$ ^*example$^*.$^*.$[][ <\/p> <\/body> <\/html>
voikkospell outputs correct Finnish words like ^HTML/HTML$
; it outputs incorrect words with suggestions like ^*An/N/Anu/En/Ane/San$
and those with no suggestions like ^*Example$
.