Difference between revisions of "User:Ranjan19"

From Apertium
Jump to navigation Jump to search
 
Line 1: Line 1:
This guide shows how to compile a speller for an lttoolbox-based analyser and test it on some text. It is possible to use morphologies created in the Apertium platform directly as spellcheckers using the '''libvoikko''' and '''libreoffice-voikko''' extension. We will do exactly that. Apertium hfst transducers can be compiled into libraries that libvoikko can use to perform spell checking, including providing suggestions.
This guide shows how to compile a speller for an lttoolbox-based analyser and test it on some text. It is possible to use morphologies created in the Apertium platform directly as spellcheckers using the '''libvoikko''' and '''libreoffice-voikko''' extension. We will do exactly that. Apertium hfst transducers can be compiled into libraries that libvoikko can use to perform spell checking, including providing suggestions.
The hfst based languages like Apertium-kaz have already accomplished this task so we'll follow their lead and take a lot of help from their implementation to compile the speller. We will first converting '''lttoolbox to hfst''' and then generating a '''zhfst'''(to be used as a spell-checker) for an lttoolbox-based language such as Hindi(the sample language in this case).
The hfst based languages like Apertium-kaz have already accomplished this task so we'll follow their lead and take a lot of help from their implementation to compile the speller. We will first converting '''lttoolbox to hfst''' and then generating a '''zhfst'''(to be used as a spell-checker) for an lttoolbox-based language such as Hindi(the sample language in this case).



Latest revision as of 19:21, 16 March 2016

This guide shows how to compile a speller for an lttoolbox-based analyser and test it on some text. It is possible to use morphologies created in the Apertium platform directly as spellcheckers using the libvoikko and libreoffice-voikko extension. We will do exactly that. Apertium hfst transducers can be compiled into libraries that libvoikko can use to perform spell checking, including providing suggestions.

The hfst based languages like Apertium-kaz have already accomplished this task so we'll follow their lead and take a lot of help from their implementation to compile the speller. We will first converting lttoolbox to hfst and then generating a zhfst(to be used as a spell-checker) for an lttoolbox-based language such as Hindi(the sample language in this case).

As a prerequisite we need to install libreoffice-voikko on Ubuntu/Debian.This guide shows how to do things by manually compiling most of it manually.

You should be able to do make do with just packages from apt-get, so try that first: Using apertium spellers with libreoffice-voikko on Debian. But if one of the packages isn't working, (parts of) this guide may be helpful.


Install prerequisites[edit]

First add the repository by following Prerequisites for Debian.

Then install speller-prerequisites:

sudo apt-get install libreoffice python3 git make sed findutils zip unzip pkg-config gettext \
  libxml++2.6-dev libarchive-dev zlib1g-dev unzip automake autoconf libtool flex bison g++ libreadline-dev hfst

Manually compile other prerequisites[edit]

hfst-ospell[edit]

wget http://downloads.sourceforge.net/project/hfst/hfst/source/hfstospell-0.4.0.tar.gz
tar xvf hfstospell-0.4.0.tar.gz
cd hfstospell-0.4.0
./configure --enable-zhfst
make -j4
sudo make install

Libvoikko[edit]

sudo apt-get install hfst-ospell-dev

wget http://www.puimula.org/voikko-sources/libvoikko/libvoikko-4.0.tar.gz
tar xvf libvoikko-4.0.tar.gz
cd libvoikko-4.0
./autogen.sh
./configure --with-dictionary-path=$HOME/.voikko --enable-hfst
make -j4
sudo make install

You may also have to put

export LD_LIBRARY_PATH=/usr/local/lib

in your ~/.bashrc

voikko-fi[edit]

wget http://www.puimula.org/voikko-sources/voikko-fi/voikko-fi-2.0.tar.gz
tar xvf voikko-fi-2.0.tar.gz
cd voikko-fi-2.0
PATH=/usr/local/voikko/bin:$PATH make vvfst
sudo make vvfst-install DESTDIR=/usr/local/voikkodict


Install language modules[edit]

This is based on Minimal installation from SVN:

  • To install Kazakh language module, first get it. We'll need Kazakh module in the implementation as mentioned earlier. We don't need to build it.
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz/

  • To install Hindi language module, first get it. We'll need Kazakh module in the implementation as mentioned earlier.
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-hin/
cd apertium-hin
./autogen.sh --enable-ospell
make -j4


Converting lttoolbox binary into hfst[edit]

lt-print hin.autogen.bin | hfst-txt2fst -e ε > hin.autogen.hfst
cd ..

Here `lt-print foo.autogen.bin` gives a .att file which is taken as an input to hfst-txt2fst.


Generating a zhfst package from the hfst file generated[edit]

Here we have observed how apertium-kaz generates the zhfst file and implemented that here.

  • Copy “dev/editdist.py” and “speller” from apertium-kaz to apertium-hin.
cp apertium-kaz/dev/editdist.py  apertium-hin/dev/editdist.py
cp -r apertium-kaz/speller  apertium-hin/speller
  • Run the following commands in the Hindi language module. These commands have been taken from apertium-kaz zhfst conversion when we compile apertium-kaz i.e during the `make` of apertium-kaz.
cd apertium-hin
cat hin.autogen.hfst | hfst-fst2fst -t | hfst-project --project=lower | hfst-minimise |hfst-fst2fst -f olw -o acceptor.default.hfst
grep -v -e "^#" -e "^$" speller/words.default.txt | hfst-strings2fst -j -o words.default.hfst
echo "?*;" | hfst-regexp2fst -S -o anystar.hfst
grep -v -e "^#" -e "^$" speller/strings.default.txt | hfst-strings2fst -j | hfst-concatenate anystar.hfst - \
 | hfst-concatenate - anystar.hfst -o strings.default.hfst
python dev/editdist.py -v -s -d 1 -e '@0@' -i speller/editdist.default.txt -a acceptor.default.hfst > editdist.default.hfst.txt
hfst-txt2fst -i editdist.default.hfst.txt -e '@0@' -o editdist.default.hfst
rm -f editdist.default.hfst.txt
hfst-disjunct strings.default.hfst editdist.default.hfst | hfst-minimise | hfst-repeat -f 1 -t 2 -o editstrings.default.hfst
hfst-disjunct words.default.hfst editstrings.default.hfst | hfst-fst2fst -f olw -o errmodel.default.hfst
rm -f hin.zhfst
zip -Z store -j hin.zhfst acceptor.default.hfst errmodel.default.hfst speller/index.xml


Here we expect that the proper zhfst file has been generated in the hindi language module.


Using the zhfst file as a spellchecker and testing it from the command line[edit]

There are two ways to go about this.

  • Using hfst-ospell
echo "कौशल" | hfst-ospell -S hin.zhfst
echo "आदमी" | hfst-ospell -S hin.zhfst

In the above two cases, since the spellings are correct the output should be “The word is in the lexicon”. Not working. The error comes “word not in the lexicon, cannot provide corrections”.

echo "अगामी" | hfst-ospell -S hin.zhfst

In this case the spelling of the word is incorrect and requires a trivial correction. The output would be the correct options. Not working. The error comes “word not in the lexicon, cannot provide corrections”.

  • Using voikkospell
cd apertium-hin  
cp hin.zhfst ~/.voikko/3/hin.zhfst
echo "अगामी" | tr ' ' '\n' | voikkospell -d hin.zhfst -s

In this case the spelling of the word is incorrect and requires a trivial correction. The output would be the correct options.