Using an lttoolbox dictionary
This page is intended as an answer to the question "I've found one of these .dix
files; how can I use it to analyse text?" First of all, it is worth explaining what a .dix
file is: a finite-state transducer for a language encoded in XML. More information on this can be found at the page lttoolbox and monodix basics, but this page only concerns how it is used.
(If you haven't found a .dix file for your language yet, see List of dictionaries.)
Requirements
The most basic requirements are:
- lttoolbox — A finite-state toolkit
- apertium — A machine translation software platform
The second is necessary for the deformatters. The tools in lttoolbox have a set of escaped characters which must be escaped in running text (see Apertium stream format).
The page Installation shows how to install lttoolbox and apertium. On most systems, you don't have to install more than the Prerequisites.
Using the dictionary
Then, you take the .dix
file (e.g. apertium-bn-en.bn.dix
) that you have downloaded, and compile it:
Compile
- See also: Compiling dictionaries
This compiles an analyser:
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin final@inconditional 8 75 main@standard 6403 13351
Analyse
Note that the apertium-destxt
command is important.
$ echo "উইকিপিডিয়ার বাংলা সংস্করণে স্বাগতম। এই বিশ্বকোষে যে কেউ অবদান রাখতে পারেন। ২১,২৫৫টি ভুক্তির ওপর কাজ চলছে।" | apertium-destxt | lt-proc bn.analyser.bin ^উইকিপিডিয়ার/*উইকিপিডিয়ার$ ^বাংলা/বাংলা<adj><mf>/বাংলা<n><mf><nn><sg><nom>/বাংলা<n><mf><nn><sg><obj>$ ^সংস্করণে/*সংস্করণে$ ^স্বাগতম/*স্বাগতম$^।/।<sent>$ ^এই/এই<det><dem>$ ^বিশ্বকোষে/*বিশ্বকোষে$ ^যে/যা<prn><p3><infml><rel><aa><mf><sg><nom>$ ^কেউ/কেউ<prn><p3><aa><mf><sp><nom>$ ^অবদান/অবদান<n><nt><nn><sg><nom>/অবদান<n><nt><nn><sg><obj>$ ^রাখতে/রাখ<vblex><inf>/রাখ<vblex><past><hbtl><p2><fam>$ ^পারেন/পার<vblex><pres><smpl><p3><pol>/পার<vblex><pres><smpl><p2><pol>$^।/।<sent>$ ^২১/২১<num>$, ^২৫৫টি/২৫৫<num>$ ^ভুক্তির/*ভুক্তির$ ^ওপর/ওপর<adv>/ওপর<n><mf><nn><sg><nom>/ওপর<n><mf><nn><sg><obj>$ ^কাজ/কাজ<n><nt><nn><sg><nom>/কাজ<n><nt><nn><sg><obj>$ ^চলছে/চল<vblex><pres><cnt><impers>/চল<vblex><pres><cnt><p3><infml>$^।/।<sent>$^./.<sent>$[][ ]
because if unescaped special characters appear in the stream, you will get a std::exception
:
$ echo "This is a test ^500" | lt-proc bn.analyser.bin This is a test std::exception
(on a Mac, you'll typically see a 9Exception
)
Generate
When generating, you basically input the analyses given by the analyser, but only one analysis per lexical unit. The general input format is
^lemma<tag><tag2><tag3>$ ^otherlemma<othertag><tag2>$
E.g. to generate a couple of the analyses given in the analysis example above:
$ echo '^বাংলা<adj><mf>$ ^।<sent>$ ^এই<det><dem>$' | lt-proc -g bn.generator.bin বাংলা । এই
See also
- List of dictionaries
- Daemon – using an lttoolbox dictionary as a "server", without re-loading the dictionary for each request