Difference between revisions of "Using an lttoolbox dictionary"

From Apertium
Jump to navigation Jump to search
 
(12 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[Utilisation d'un dictionnaire lttoolbox|En français]]

{{TOCD}}
{{TOCD}}
This page is intended as an answer to the question "I've found one of these <code>.dix</code> files, how can I use it to analyse text?" First of all, it is worth explaining what a <code>.dix</code> file is, it is a finite-state transducer for a language encoded in XML. More information on this can be found at the page [[lttoolbox]] and [[monodix basics]], but this page is only interested in how it is used.
This page is intended as an answer to the question "I've found one of these [[List of dictionaries|<code>.dix</code> files]]; how can I use it to analyse text?" First of all, it is worth explaining what a <code>.dix</code> file is: a finite-state transducer for a language encoded in XML. More information on this can be found at the page [[lttoolbox]] and [[monodix basics]], but this page only concerns how it is used.

(If you haven't found a .dix file for your language yet, see [[List of dictionaries]].)


==Requirements==
==Requirements==
Line 11: Line 15:
The second is necessary for the [[deformatters]]. The tools in [[lttoolbox]] have a set of escaped characters which must be escaped in running text (see [[Apertium stream format]]).
The second is necessary for the [[deformatters]]. The tools in [[lttoolbox]] have a set of escaped characters which must be escaped in running text (see [[Apertium stream format]]).


The page [[Installation]] shows how to install lttoolbox and apertium. On most systems, you don't have to install more than the Prerequisites.
If you have a machine running GNU/Linux or Mac/OS then you can probably install both of these programs fairly easily. For lttoolbox:

<pre>
$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
cd lttoolbox/
sh autogen.sh
./configure
make
make install
</pre>

And for apertium:

<pre>
$ svn co http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
cd apertium/
sh autogen.sh
./configure
make
make install
</pre>

Subversion (<code>svn</code>) is a version control system. If you don't have it installed, on Debian/Ubuntu GNU/Linux you can use <code>apt-get install subversion</code> (or get it through Synaptic). On Mac/OS you can use <code>port install subversion</code>.


==Using the dictionary==
==Using the dictionary==
Line 42: Line 24:
{{see-also|Compiling dictionaries}}
{{see-also|Compiling dictionaries}}


This compiles an analyser:
<pre>
<pre>
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin
$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin
Line 48: Line 31:
</pre>
</pre>


===Use===
===Analyse===


Note that the <code>apertium-destxt</code> command is important.
Note that the <code>apertium-destxt</code> command is important.
Line 54: Line 37:
<pre>
<pre>
$ echo "উইকিপিডিয়ার বাংলা সংস্করণে স্বাগতম। এই বিশ্বকোষে যে কেউ অবদান রাখতে পারেন। ২১,২৫৫টি ভুক্তির ওপর কাজ চলছে।" | apertium-destxt | lt-proc bn.analyser.bin
$ echo "উইকিপিডিয়ার বাংলা সংস্করণে স্বাগতম। এই বিশ্বকোষে যে কেউ অবদান রাখতে পারেন। ২১,২৫৫টি ভুক্তির ওপর কাজ চলছে।" | apertium-destxt | lt-proc bn.analyser.bin
^উইকিপিডিয়ার/*উইকিপিডিয়ার$ ^বাংলা/বাংলা<adj><mf>/বাংলা<n><mf><nn><sg><nom>/বাংলা<n><mf><nn><sg><obj>$ ^সংস্করণে/*সংস্করণে$ ^স্বাগতম/*স্বাগতম$^।/।<sent>$ ^এই/এই<det><dem>$
^উইকিপিডিয়ার/*উইকিপিডিয়ার$ ^বাংলা/বাংলা<adj><mf>/বাংলা<n><mf><nn><sg><nom>/বাংলা<n><mf><nn><sg><obj>$ ^সংস্করণে/*সংস্করণে$ ^স্বাগতম/*স্বাগতম$^।/।<sent>$
^বিশ্বকোষে/*বিশ্বকোষে$ ^যে/যা<prn><p3><infml><rel><aa><mf><sg><nom>$ ^কেউ/কেউ<prn><p3><aa><mf><sp><nom>$ ^অবদান/অবদান<n><nt><nn><sg><nom>/অবদান<n><nt><nn><sg><obj>$
^এই/এই<det><dem>$ ^বিশ্বকোষে/*বিশ্বকোষে$ ^যে/যা<prn><p3><infml><rel><aa><mf><sg><nom>$ ^কেউ/কেউ<prn><p3><aa><mf><sp><nom>$
^রাখতে/রাখ<vblex><inf>/রাখ<vblex><past><hbtl><p2><fam>$ ^পারেন/পার<vblex><pres><smpl><p3><pol>/পার<vblex><pres><smpl><p2><pol>$^।/।<sent>$ ^২১/২১<num>$,
^অবদান/অবদান<n><nt><nn><sg><nom>/অবদান<n><nt><nn><sg><obj>$ ^রাখতে/রাখ<vblex><inf>/রাখ<vblex><past><hbtl><p2><fam>$
^পারেন/পার<vblex><pres><smpl><p3><pol>/পার<vblex><pres><smpl><p2><pol>$^।/।<sent>$ ^২১/২১<num>$, ^২৫৫টি/২৫৫<num>$ ^ভুক্তির/*ভুক্তির$
^২৫৫টি/২৫৫<num>$ ^ভুক্তির/*ভুক্তির$ ^ওপর/ওপর<adv>/ওপর<n><mf><nn><sg><nom>/ওপর<n><mf><nn><sg><obj>$ ^কাজ/কাজ<n><nt><nn><sg><nom>/কাজ<n><nt><nn><sg><obj>$
^ওপর/ওপর<adv>/ওপর<n><mf><nn><sg><nom>/ওপর<n><mf><nn><sg><obj>$ ^কাজ/কাজ<n><nt><nn><sg><nom>/কাজ<n><nt><nn><sg><obj>$
^চলছে/চল<vblex><pres><cnt><impers>/চল<vblex><pres><cnt><p3><infml>$^।/।<sent>$^./.<sent>$[][
^চলছে/চল<vblex><pres><cnt><impers>/চল<vblex><pres><cnt><p3><infml>$^।/।<sent>$^./.<sent>$[][
]
]
</pre>
</pre>


If unescaped special characters appear in the [[Apertium stream format|stream]], the error will be <code>std::exception</code>:
because if unescaped special characters appear in the [[Apertium stream format|stream]], you will get a <code>std::exception</code>:


<pre>
<pre>
$ echo "This is a test ^500" | lt-proc bn.analyser.bin
$ echo "This is a test ^500" | lt-proc bn.analyser.bin
This is a test std::exception
This is a test std::exception
</pre>

(on a Mac, you'll typically see a <code>9Exception</code>)

===Generate===
When generating, you basically input the analyses given by the analyser, but only one analysis per [[lexical unit]]. The general input format is
<pre>^lemma<tag><tag2><tag3>$ ^otherlemma<othertag><tag2>$</pre>

E.g. to generate a couple of the analyses given in the analysis example above:
<pre>
$ echo '^বাংলা<adj><mf>$ ^।<sent>$ ^এই<det><dem>$' | lt-proc -g bn.generator.bin
বাংলা । এই
</pre>
</pre>


Line 72: Line 68:


* [[List of dictionaries]]
* [[List of dictionaries]]
* [[Daemon]] – using an lttoolbox dictionary as a "server", without re-loading the dictionary for each request


[[Category:Documentation]]
[[Category:Documentation in English]]
[[Category:Lttoolbox|*]]
[[Category:Morphological analysers]]

Latest revision as of 07:34, 9 September 2015

En français

This page is intended as an answer to the question "I've found one of these .dix files; how can I use it to analyse text?" First of all, it is worth explaining what a .dix file is: a finite-state transducer for a language encoded in XML. More information on this can be found at the page lttoolbox and monodix basics, but this page only concerns how it is used.

(If you haven't found a .dix file for your language yet, see List of dictionaries.)

Requirements[edit]

The most basic requirements are:

  • lttoolbox — A finite-state toolkit
  • apertium — A machine translation software platform

The second is necessary for the deformatters. The tools in lttoolbox have a set of escaped characters which must be escaped in running text (see Apertium stream format).

The page Installation shows how to install lttoolbox and apertium. On most systems, you don't have to install more than the Prerequisites.

Using the dictionary[edit]

Then, you take the .dix file (e.g. apertium-bn-en.bn.dix) that you have downloaded, and compile it:

Compile[edit]

See also: Compiling dictionaries

This compiles an analyser:

$ lt-comp lr apertium-bn-en.bn.dix bn.analyser.bin
final@inconditional 8 75
main@standard 6403 13351

Analyse[edit]

Note that the apertium-destxt command is important.

$ echo "উইকিপিডিয়ার বাংলা সংস্করণে স্বাগতম। এই বিশ্বকোষে যে কেউ অবদান রাখতে পারেন। ২১,২৫৫টি ভুক্তির ওপর কাজ চলছে।" | apertium-destxt | lt-proc bn.analyser.bin 
^উইকিপিডিয়ার/*উইকিপিডিয়ার$ ^বাংলা/বাংলা<adj><mf>/বাংলা<n><mf><nn><sg><nom>/বাংলা<n><mf><nn><sg><obj>$ ^সংস্করণে/*সংস্করণে$ ^স্বাগতম/*স্বাগতম$^।/।<sent>$ 
^এই/এই<det><dem>$ ^বিশ্বকোষে/*বিশ্বকোষে$ ^যে/যা<prn><p3><infml><rel><aa><mf><sg><nom>$ ^কেউ/কেউ<prn><p3><aa><mf><sp><nom>$ 
^অবদান/অবদান<n><nt><nn><sg><nom>/অবদান<n><nt><nn><sg><obj>$ ^রাখতে/রাখ<vblex><inf>/রাখ<vblex><past><hbtl><p2><fam>$ 
^পারেন/পার<vblex><pres><smpl><p3><pol>/পার<vblex><pres><smpl><p2><pol>$^।/।<sent>$ ^২১/২১<num>$, ^২৫৫টি/২৫৫<num>$ ^ভুক্তির/*ভুক্তির$ 
^ওপর/ওপর<adv>/ওপর<n><mf><nn><sg><nom>/ওপর<n><mf><nn><sg><obj>$ ^কাজ/কাজ<n><nt><nn><sg><nom>/কাজ<n><nt><nn><sg><obj>$ 
^চলছে/চল<vblex><pres><cnt><impers>/চল<vblex><pres><cnt><p3><infml>$^।/।<sent>$^./.<sent>$[][
]

because if unescaped special characters appear in the stream, you will get a std::exception:

$ echo "This is a test ^500" | lt-proc bn.analyser.bin 
This is a test std::exception

(on a Mac, you'll typically see a 9Exception)

Generate[edit]

When generating, you basically input the analyses given by the analyser, but only one analysis per lexical unit. The general input format is

^lemma<tag><tag2><tag3>$ ^otherlemma<othertag><tag2>$

E.g. to generate a couple of the analyses given in the analysis example above:

$ echo '^বাংলা<adj><mf>$ ^।<sent>$ ^এই<det><dem>$' | lt-proc -g bn.generator.bin
বাংলা । এই

See also[edit]