Difference between revisions of "Using an lttoolbox dictionary"
Objectivesea (talk | contribs) (Minor English syntactic improvement in description) |
(refer to installation instead; old text likely to give people problems with PKG_CONFIG_PATH etc etc etc) |
||
Line 11: | Line 11: | ||
The second is necessary for the [[deformatters]]. The tools in [[lttoolbox]] have a set of escaped characters which must be escaped in running text (see [[Apertium stream format]]). |
The second is necessary for the [[deformatters]]. The tools in [[lttoolbox]] have a set of escaped characters which must be escaped in running text (see [[Apertium stream format]]). |
||
The page [[Installation]] shows how to install lttoolbox and apertium. When you get to the step [[Minimal installation from SVN]], it will assume you're installing <code>apertium-lex-tools</code> and a full language pair, but you can simply skip those two packages. |
|||
If you have a machine running GNU/Linux or Mac/OS then you can probably install both of these programs fairly easily. For lttoolbox: |
|||
<pre> |
|||
$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox |
|||
cd lttoolbox/ |
|||
sh autogen.sh |
|||
./configure |
|||
make |
|||
make install |
|||
</pre> |
|||
And for apertium: |
|||
<pre> |
|||
$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium |
|||
cd apertium/ |
|||
sh autogen.sh |
|||
./configure |
|||
make |
|||
make install |
|||
</pre> |
|||
Subversion (<code>svn</code>) is a version control system. If you don't have it installed, on Debian/Ubuntu GNU/Linux you can use <code>apt-get install subversion</code> (or get it through Synaptic). On Mac/OS you can use <code>port install subversion</code> (requires [http://www.macports.org/ MacPorts]). |
|||
See [[Installation]] for more information and troubleshooting. |
|||
==Using the dictionary== |
==Using the dictionary== |
Revision as of 08:16, 8 December 2013
This page is intended as an answer to the question "I've found one of these .dix
files; how can I use it to analyse text?" First of all, it is worth explaining what a .dix
file is: a finite-state transducer for a language encoded in XML. More information on this can be found at the page lttoolbox and monodix basics, but this page only concerns how it is used.
Requirements
The most basic requirements are:
- lttoolbox — A finite-state toolkit
- apertium — A machine translation software platform
The second is necessary for the deformatters. The tools in lttoolbox have a set of escaped characters which must be escaped in running text (see Apertium stream format).
The page Installation shows how to install lttoolbox and apertium. When you get to the step Minimal installation from SVN, it will assume you're installing apertium-lex-tools
and a full language pair, but you can simply skip those two packages.
Using the dictionary
Then, you take the .dix
file (e.g. apertium-bn-en.bn.dix
) that you have downloaded, and compile it:
Compile
- See also: Compiling dictionaries
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin final@inconditional 8 75 main@standard 6403 13351
Use
Note that the apertium-destxt
command is important.
$ echo "উইকিপিডিয়ার বাংলা সংস্করণে স্বাগতম। এই বিশ্বকোষে যে কেউ অবদান রাখতে পারেন। ২১,২৫৫টি ভুক্তির ওপর কাজ চলছে।" | apertium-destxt | lt-proc bn.analyser.bin ^উইকিপিডিয়ার/*উইকিপিডিয়ার$ ^বাংলা/বাংলা<adj><mf>/বাংলা<n><mf><nn><sg><nom>/বাংলা<n><mf><nn><sg><obj>$ ^সংস্করণে/*সংস্করণে$ ^স্বাগতম/*স্বাগতম$^।/।<sent>$ ^এই/এই<det><dem>$ ^বিশ্বকোষে/*বিশ্বকোষে$ ^যে/যা<prn><p3><infml><rel><aa><mf><sg><nom>$ ^কেউ/কেউ<prn><p3><aa><mf><sp><nom>$ ^অবদান/অবদান<n><nt><nn><sg><nom>/অবদান<n><nt><nn><sg><obj>$ ^রাখতে/রাখ<vblex><inf>/রাখ<vblex><past><hbtl><p2><fam>$ ^পারেন/পার<vblex><pres><smpl><p3><pol>/পার<vblex><pres><smpl><p2><pol>$^।/।<sent>$ ^২১/২১<num>$, ^২৫৫টি/২৫৫<num>$ ^ভুক্তির/*ভুক্তির$ ^ওপর/ওপর<adv>/ওপর<n><mf><nn><sg><nom>/ওপর<n><mf><nn><sg><obj>$ ^কাজ/কাজ<n><nt><nn><sg><nom>/কাজ<n><nt><nn><sg><obj>$ ^চলছে/চল<vblex><pres><cnt><impers>/চল<vblex><pres><cnt><p3><infml>$^।/।<sent>$^./.<sent>$[][ ]
because if unescaped special characters appear in the stream, you will get a std::exception
:
$ echo "This is a test ^500" | lt-proc bn.analyser.bin This is a test std::exception
(on a Mac, you'll typically see a 9Exception
)