Using an lttoolbox dictionary

From Apertium
Revision as of 16:54, 25 March 2010 by Francis Tyers (talk | contribs) (Created page with 'This page is intended as an answer to the question "I've found one of these <code>.dix</code> files, how can I use it to analyse text?" First of all, it is worth explaining what …')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page is intended as an answer to the question "I've found one of these .dix files, how can I use it to analyse text?" First of all, it is worth explaining what a .dix file is, it is a finite-state transducer for a language encoded in XML. More information on this can be found at the page lttoolbox and monodix basics, but this page is only interested in how it is used.

Requirements

The most basic requirements are:

  • lttoolbox — A finite-state toolkit
  • apertium — A machine translation software platform

The second is necessary for the deformatters. The tools in lttoolbox have a set of escaped characters which must be escaped in running text (see Apertium stream format).

If you have a machine running GNU/Linux or Mac/OS then you can probably install both of these programs fairly easily. For lttoolbox:

$ http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
cd lttoolbox/
sh autogen.sh
./configure
make
make install

And for apertium:

$ http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
cd apertium/
sh autogen.sh
./configure
make
make install

Using the dictionary

Then, you take the .dix file (e.g. apertium-bn-en.bn.dix) that you have downloaded, and compile it:

$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin
final@inconditional 8 75
main@standard 6403 13351
$ echo "উইকিপিডিয়ার বাংলা সংস্করণে স্বাগতম। এই বিশ্বকোষে যে কেউ অবদান রাখতে পারেন। ২১,২৫৫টি ভুক্তির ওপর কাজ চলছে।" | apertium-destxt | lt-proc bn.analyser.bin 
^উইকিপিডিয়ার/*উইকিপিডিয়ার$ ^বাংলা/বাংলা<adj><mf>/বাংলা<n><mf><nn><sg><nom>/বাংলা<n><mf><nn><sg><obj>$ ^সংস্করণে/*সংস্করণে$ ^স্বাগতম/*স্বাগতম$^।/।<sent>$ ^এই/এই<det><dem>$ 
^বিশ্বকোষে/*বিশ্বকোষে$ ^যে/যা<prn><p3><infml><rel><aa><mf><sg><nom>$ ^কেউ/কেউ<prn><p3><aa><mf><sp><nom>$ ^অবদান/অবদান<n><nt><nn><sg><nom>/অবদান<n><nt><nn><sg><obj>$ 
^রাখতে/রাখ<vblex><inf>/রাখ<vblex><past><hbtl><p2><fam>$ ^পারেন/পার<vblex><pres><smpl><p3><pol>/পার<vblex><pres><smpl><p2><pol>$^।/।<sent>$ ^২১/২১<num>$,
^২৫৫টি/২৫৫<num>$ ^ভুক্তির/*ভুক্তির$ ^ওপর/ওপর<adv>/ওপর<n><mf><nn><sg><nom>/ওপর<n><mf><nn><sg><obj>$ ^কাজ/কাজ<n><nt><nn><sg><nom>/কাজ<n><nt><nn><sg><obj>$ 
^চলছে/চল<vblex><pres><cnt><impers>/চল<vblex><pres><cnt><p3><infml>$^।/।<sent>$^./.<sent>$[][
]