Voikkospell

From Apertium
Jump to navigation Jump to search

Installation[edit]

m5w/corevoikko, a fork of corevoikko, supports apertium stream format.

To clone it, execute the following command.

git clone https://github.com/m5w/corevoikko.git corevoikko

First, install libvoikko's dependencies. Next, execute the following commands.

cd corevoikko/libvoikko
./configure
make
sudo make install

If you do not have root privileges or would like to specify where to install libvoikko, execute the following instead. (Otherwise, you are finished with installation.)

cd corevoikko/libvoikko
PREFIX="$HOME/install/corevoikko" # e.g.
./configure --prefix="$PREFIX"
make
make install

Finally, add your "$PREFIX" to your "$PATH" by appending the following lines to your .profile.

PREFIX="$HOME/install/corevoikko" # e.g.
if [ -d "$PREFIX" ]; then
        export PATH="$PREFIX/bin:$PATH"
fi

Using voikkospell with apertium Stream Format[edit]

Invoke voikkospell with --apertium-stream. voikkospell then expects apertium-stream-formatted input instead of a list of words.

Words in apertium Stream Format[edit]

apertium stream format encodes words as lexical units. Each begins with a ^

^ . . .

and ends with a $.

^ . . . $

The word immediately follows the ^,

^word . . .$

and a / immediately follows the word. If the word is unknown, *word follows;

^word/*word$

otherwise, all the word's analyses follow, delimited by /'s.

^word/word<n><sg>/word<vblex><inf>/word<vblex><pres>$

Escaping[edit]

To use ^, $, /, <, and > as characters, one must escape them. Each escape sequence begins with a \,

\ . . .

and a character follows. voikkospell then interprets the character literally. Note that the character can be any wide character, including newlines.

To use \'s as characters, one must escape them.

Superblanks[edit]

One can also escape multiple characters not encoded in lexical units by encoding them as a superblank. Each superblank begins with a [

[ . . .

and ends with a ].

[ . . . ]

Each ^, $, /, <, and > between the [ and the ] is interpreted literally.

To use [ and ] as characters, one must escape them.

An HTML Example[edit]

Let's spellcheck the following webpage.

<!DOCTYPE html>
<html>
    <head>
        <title>An HTML Example</title>
    </head>
    <body>
        <p>
        This is an HTML example.
        </p>
    </body>
</html>

Running apertium-deshtml on it yields the following.

.[][<!DOCTYPE html>
<html>
    <head>
        <title>]An HTML Example.[][<\/title>
    <\/head>
    <body>
        <p>
        ]This is an HTML example..[][
        <\/p>
    <\/body>
<\/html>
]

Note that all the <'s, >'s, and /'s are encoded as superblanks. In fact, everything except the title and body paragraph is escaped. However, those words are not yet encoded as lexical units. Running lt-proc on the output yields the following, suitable for voikkospell --apertium-stream.

^./.<sent>$[][<!DOCTYPE html>
<html>
    <head>
        <title>]^An/A<det><ind><sg>$ ^HTML/HTML<n><acr><sp>$ ^Example/Example<n><sg>$^./.<sent>$[][<\/title>
    <\/head>
    <body>
        <p>
        ]^This/This<det><dem><sg>/This<prn><tn><mf><sg>$ ^is/be<vbser><pri><p3><sg>$ ^an/a<det><ind><sg>$ ^HTML/HTML<n><acr><sp>$ ^example/example<n><sg>$^./.<sent>$^./.<sent>$[][
        <\/p>
    <\/body>
<\/html>

Running voikkospell --apertium-stream on this yields the following final output.

^*.$[][<!DOCTYPE html>
<html>
    <head>
        <title>]^*An/N/Anu/En/Ane/San$ ^HTML/HTML$ ^*Example$^*.$[][<\/title>
    <\/head>
    <body>
        <p>
        ]^*This$ ^*is/ies/iso/s/isä/i$ ^*an/en/ane/n/a/van$ ^HTML/HTML$ ^*example$^*.$^*.$[][
        <\/p>
    <\/body>
<\/html>

voikkospell outputs correct Finnish words like ^HTML/HTML$; it outputs incorrect words with suggestions like ^*An/N/Anu/En/Ane/San$ and those with no suggestions like ^*Example$.