Difference between revisions of "Talk:Matxin"

From Apertium
Jump to navigation Jump to search
 
Line 41: Line 41:
   
 
==Old instructions (before 2016)==
 
==Old instructions (before 2016)==
  +
  +
  +
==Prerequisites==
  +
  +
===Debian/buntu===
  +
  +
Install freeling-3.1 from the tarball; prerequisites include
  +
<pre>
  +
sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \
  +
libboost-program-options-dev libboost-thread-dev
  +
</pre>
  +
Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like:
  +
<pre>dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system</pre>
  +
Then <pre>autoreconf -fi
  +
./configure --prefix=$HOME/PREFIX/freeling
  +
make
  +
make install
  +
</pre>
  +
  +
  +
Add the [[Debian|nightly repo]] and do
  +
<pre>
  +
sudo apt-get install apertium-all-dev foma-bin libfoma0-dev
  +
</pre>
  +
Then just
  +
<pre>
  +
git clone https://github.com/matxin/matxin
  +
cd matxin
  +
export PATH="${PATH}:$HOME/PREFIX/freeling/bin
  +
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib"
  +
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig"
  +
export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal"
  +
autoreconf -fi
  +
./configure --prefix=$HOME/PREFIX/matxin
  +
make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib"
  +
</pre>
  +
  +
: having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug?
  +
  +
=== old prerequisites ===
  +
* BerkleyDB &mdash; sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev)
  +
* libpcre3 &mdash; sudo apt-get install libpcre3-dev
  +
  +
Install the following libraries in <prefix>,
  +
  +
* libcfg+ &mdash; http://platon.sk/upload/_projects/00003/libcfg+-0.6.2.tar.gz
  +
* libomlet (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/omlet</code>)
  +
* libfries (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/fries</code>)
  +
* FreeLing (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/freeling</code>)
  +
:If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix>
  +
* [[lttoolbox]] (from SVN) &mdash; (<code>svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox</code>) Take as a minimum version 3.1.1; 3.1.0 and lower versions cause data error and error messages in Matxin due to a missing string close.
  +
  +
==Building==
  +
  +
;Checkout
  +
  +
<pre>
  +
$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin
  +
</pre>
  +
  +
Then do the usual:
  +
  +
<pre>
  +
$ ./configure --prefix=<prefix>
  +
$ make
  +
</pre>
  +
  +
After you've got it built, do:
  +
  +
<pre>
  +
$ su
  +
# export LD_LIBRARY_PATH=/usr/local/lib
  +
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
  +
# make install
  +
</pre>
  +
  +
===Mac OS X===
  +
If you've installed boost etc. with Macports, for the configure step do:
  +
  +
env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure
  +
  +
(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does)
  +
  +
Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat.
  +
  +
== Executing ==
  +
  +
The default for <code>MATXIN_DIR</code>, if you have not specified a prefix is <code>/usr/local/bin</code>, if you have not specified a prefix, then you should <code>cd /usr/local/bin</code> to make the tests.
  +
  +
Bundled with Matxin there's a script called <code>Matxin_translator</code> which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations.
  +
  +
<pre>
  +
$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg
  +
</pre>
  +
  +
There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer.
  +
  +
txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer.
  +
  +
Calling it with -h or --help displays help information.
  +
You could write the following to show how the word "gener" is analysed:
  +
  +
echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg
  +
  +
For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules.
  +
  +
=== Spanish-Basque ===
  +
<prefix> is typically /usr/local
  +
  +
<pre>
  +
$ export MATXIN_DIR=<prefix>
  +
$ echo "Esto es una prueba" | \
  +
./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./ST_verb -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
  +
./reFormat
  +
  +
Da proba bat hau
  +
</pre>
  +
  +
=== English-Basque ===
  +
  +
Using the above example for English-Basque looks:
  +
  +
<pre>
  +
$ cat src/matxinallen.sh
  +
src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/ST_verb -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
  +
src/reFormat
  +
  +
$ echo "This is a test" | sh src/matxin_allen.sh
  +
Hau proba da
  +
  +
$ echo "How are you?" | sh src/matxin_allen.sh
  +
Nola zu da?
  +
  +
$ echo "Otto plays football and tennis" | sh src/matxin_allen.sh
  +
Otto-ak jokatzen du futbola tenis-a eta
  +
</pre>
  +
  +
==Speed==
  +
  +
Between 25--30 words per second.
  +
  +
==Troubleshooting==
  +
  +
===libdb===
  +
<pre>
  +
g++ -g -O2 -ansi -march=i686 -O3 -fno-pic
  +
-fomit-frame-pointer -L/usr/local/lib -L/usr/lib
  +
-o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet
  +
-lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre
  +
  +
/usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**)
  +
  +
[and a lot of similar lines]
  +
</pre>
  +
  +
Try installing libdb4.8++-dev[http://sourceforge.net/mailarchive/forum.php?thread_name=1313552553.4706.7316.camel%40eki.dlsi.ua.es&forum_name=matxin-devel]
  +
  +
===libcfg+===
  +
  +
If you get the following error:
  +
  +
<pre>
  +
ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
  +
</pre>
  +
  +
Delete the directory, and start from scratch, this time when you call make, call it with <code>make CFLAGS=-fPIC</code>
  +
  +
  +
===Various errors===
  +
  +
If you get the error:
  +
  +
<pre>
  +
g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2 -g -O2 -ansi -march=i686 -O3
  +
-fno-pic -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C
  +
  +
--->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden
  +
In file included from Analyzer.C:9:
  +
config.h: In constructor 'config::config(char**)':
  +
config.h:413: warning: deprecated conversion from string constant to 'char*'
  +
Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)':
  +
Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined
  +
Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s...
  +
</pre>
  +
  +
Then change the header files in <code>src/Analyzer.C</code> to:
  +
  +
<pre>
  +
//#include "freeling.h"
  +
  +
#include "util.h"
  +
#include "tokenizer.h"
  +
#include "splitter.h"
  +
#include "maco.h"
  +
#include "nec.h"
  +
#include "senses.h"
  +
#include "tagger.h"
  +
#include "hmm_tagger.h"
  +
#include "relax_tagger.h"
  +
#include "chart_parser.h"
  +
#include "maco_options.h"
  +
#include "dependencies.h"
  +
</pre>
  +
  +
Upon finding yourself battling the following compile problem,
  +
  +
<pre>
  +
Analyzer.C: In function ‘int main(int, char**)’:
  +
Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’
  +
/home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool)
  +
/home/fran/local/include/hmm_tagger.h:84: note: hmm_tagger::hmm_tagger(const hmm_tagger&)
  +
Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’
  +
/home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool)
  +
/home/fran/local/include/relax_tagger.h:51: note: relax_tagger::relax_tagger(const relax_tagger&)
  +
Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’
  +
/home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&)
  +
/home/fran/local/include/senses.h:45: note: senses::senses(const senses&)
  +
</pre>
  +
  +
Make the following changes in the file <code>src/Analyzer.C</code>:
  +
  +
<pre>
  +
if (cfg.TAGGER_which == HMM)
  +
- tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
  +
+ tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false);
  +
else if (cfg.TAGGER_which == RELAX)
  +
- tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter,
  +
+ tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter,
  +
cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon,
  +
- cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
  +
+ false);
  +
  +
if (cfg.NEC_NEClassification)
  +
neclass = new nec("NP", cfg.NEC_FilePrefix);
  +
  +
if (cfg.SENSE_SenseAnnotation!=NONE)
  +
- sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis);
  +
+ sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis);
  +
</pre>
  +
  +
Then probably there will be issues with actually running Matxin.
  +
  +
If you get the error:
  +
  +
<pre>
  +
config.h:33:29: error: freeling/traces.h: No such file or directory
  +
</pre>
  +
  +
Then change the header files in <code>src/config.h</code> to:
  +
  +
<pre>
  +
//#include "freeling/traces.h"
  +
#include "traces.h"
  +
</pre>
  +
  +
If you get this error:
  +
  +
<pre>
  +
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg
  +
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found.
  +
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found.
  +
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found.
  +
</pre>
  +
  +
You can change the tagger from the RelaxCG to HMM, edit the file <code><prefix>/share/matxin/config/es-eu.cfg</code>, and change:
  +
  +
<pre>
  +
#### Tagger options
  +
#Tagger=relax
  +
Tagger=hmm
  +
</pre>
  +
  +
Then there might be a problem in the dependency grammar:
  +
  +
<pre>
  +
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg
  +
DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto
  +
</pre>
  +
  +
The easiest thing to do here is to just remove references to the stuff it complains about:
  +
  +
<pre>
  +
cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep
  +
cat newdep | grep -v d\.class > newdep2
  +
cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat
  +
</pre>
  +
  +
===Error in db===
  +
  +
If you get:
  +
*SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db
  +
  +
rebuild senses16.deb from source:
  +
*cat senses16.src | indexdict senses16.db
  +
* (remove senses16.db before rebuild)
  +
  +
===Error when reading xml files===
  +
  +
If xml files read does not work, you get error like:
  +
<i>ERROR: invalid document: found <corpus i> when <corpus> was expected...</i>,
  +
do following in src/XML_reader.cc do:
  +
  +
1. add following subroutine after line 43:
  +
<pre>
  +
wstring
  +
mystows(string const &str)
  +
{
  +
wchar_t* result = new wchar_t[str.size()+1];
  +
size_t retval = mbstowcs(result, str.c_str(), str.size());
  +
result[retval] = L'\0';
  +
wstring result2 = result;
  +
delete[] result;
  +
return result2;
  +
}
  +
</pre>
  +
2. replace all occurencies of
  +
<pre>
  +
XMLParseUtil::stows
  +
</pre>
  +
  +
with
  +
<pre>
  +
mystows
  +
</pre>
  +
Version 3.1.1 of lttoolbox does not have this error any more.
  +
  +
==Results of the individual steps:==
  +
<pre>
  +
--------------------Step1
  +
en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f
  +
$MATXIN_DIR/share/matxin/config/es-eu.cfg
  +
<?xml version='1.0' encoding='UTF-8' ?>
  +
<corpus>
  +
<SENTENCE ord='1' alloc='0'>
  +
<CHUNK ord='2' alloc='5' type='grup-verb' si='top'>
  +
<NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'>
  +
</NODE>
  +
<CHUNK ord='1' alloc='0' type='sn' si='subj'>
  +
<NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ord='3' alloc='8' type='sn' si='att'>
  +
<NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'>
  +
<NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
</corpus>
  +
</pre>
  +
  +
<pre>
  +
---------------------Step2
  +
[glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f
  +
$MATXIN_DIR/share/matxin/config/es-eu.cfg
  +
<?xml version='1.0' encoding='UTF-8'?>
  +
<corpus >
  +
<SENTENCE ref='1' alloc='0'>
  +
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
  +
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
  +
</NODE>
  +
<CHUNK ref='1' type='is' alloc='0' si='subj'>
  +
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ref='3' type='is' alloc='8' si='att'>
  +
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
</corpus>
  +
</pre>
  +
  +
<pre>
  +
----------- step3
  +
<?xml version='1.0' encoding='UTF-8' ?>
  +
<corpus >
  +
<SENTENCE ref='1' alloc='0'>
  +
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
  +
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
  +
</NODE>
  +
<CHUNK ref='1' type='is' alloc='0' si='subj'>
  +
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ref='3' type='is' alloc='8' si='att'>
  +
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------STEP4
  +
<?xml version='1.0' encoding='UTF-8' ?>
  +
<corpus >
  +
<SENTENCE ref='1' alloc='0'>
  +
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
  +
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
  +
</NODE>
  +
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
  +
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
  +
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------STEP5
  +
<?xml version='1.0' encoding='UTF-8' ?>
  +
<corpus >
  +
<SENTENCE ref='1' alloc='0'>
  +
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
  +
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
  +
</NODE>
  +
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
  +
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
  +
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------STEP6
  +
<?xml version='1.0' encoding='UTF-8' ?>
  +
<corpus >
  +
<SENTENCE ref='1' alloc='0'>
  +
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
  +
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
  +
</NODE>
  +
<CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'>
  +
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
  +
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------STEP7
  +
<?xml version='1.0' encoding='UTF-8' ?>
  +
<corpus >
  +
<SENTENCE ref='1' alloc='0'>
  +
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
  +
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
  +
</NODE>
  +
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
  +
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
  +
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------STEP8
  +
<?xml version='1.0' encoding='UTF-8'?>
  +
<corpus >
  +
<SENTENCE ord='1' ref='1' alloc='0'>
  +
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
  +
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
  +
</NODE>
  +
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
  +
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
  +
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------STEP9
  +
<?xml version='1.0' encoding='UTF-8' ?>
  +
<corpus >
  +
<SENTENCE ord='1' ref='1' alloc='0'>
  +
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
  +
<NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
  +
</NODE>
  +
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
  +
<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
  +
<NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------- step10
  +
<?xml version='1.0' encoding='UTF-8'?>
  +
<corpus >
  +
<SENTENCE ord='1' ref='1' alloc='0'>
  +
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
  +
<NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
  +
</NODE>
  +
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
  +
<NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
  +
</NODE>
  +
</CHUNK>
  +
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
  +
<NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
  +
<NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
  +
</NODE>
  +
</NODE>
  +
</CHUNK>
  +
</CHUNK>
  +
</SENTENCE>
  +
  +
</corpus>
  +
  +
-------------STEP11
  +
Hau proba bat da
  +
  +
</pre>

Latest revision as of 12:19, 5 May 2016

Documentation Descripción del sistema de traducción es-eu Matxin page 30, 6.1: image is missing (Diagrama1.dia) Muki987 12:18, 9 April 2009 (UTC)

Please could you email this to the developers of Matxin. I have emailed you their contact details. - Francis Tyers 12:39, 9 April 2009 (UTC)

New instructions (2012)[edit]

a) install Foma

   svn co http://devel.cpl.upc.edu/freeling/svn/trunk freeling
   svn co http://matxin.svn.sourceforge.net/svnroot/matxin/trunk matxin

In freeling if you get an error like:

g++ -DPACKAGE_NAME=\"FreeLing\" -DPACKAGE_TARNAME=\"freeling\" -DPACKAGE_VERSION=\"3.0\" -DPACKAGE_STRING=\"FreeLing\ 3.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE=\"freeling\" -DVERSION=\"3.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_BOOST_REGEX_HPP=1 -DHAVE_BOOST_REGEX_ICU_HPP=1 -DHAVE_BOOST_FILESYSTEM_HPP=1 -DHAVE_BOOST_PROGRAM_OPTIONS_HPP=1 -DHAVE_BOOST_THREAD_HPP=1 -DHAVE_STDBOOL_H=1 -DSTDC_HEADERS=1 -I. -I../../src/include -I../../src/libfreeling/corrector   -O3 -Wall  -MT threaded_analyzer.o -MD -MP -MF .deps/threaded_analyzer.Tpo -c -o threaded_analyzer.o `test -f 'sample_analyzer/threaded_analyzer.cc' || echo './'`sample_analyzer/threaded_analyzer.cc
In file included from /usr/include/boost/thread/thread_time.hpp:9,
                 from /usr/include/boost/thread/locks.hpp:11,
                 from /usr/include/boost/thread/pthread/mutex.hpp:11,
                 from /usr/include/boost/thread/mutex.hpp:16,
                 from /usr/include/boost/thread/pthread/thread.hpp:14,
                 from /usr/include/boost/thread/thread.hpp:17,
                 from /usr/include/boost/thread.hpp:12,
                 from sample_analyzer/threaded_processor.h:35,
                 from sample_analyzer/threaded_analyzer.cc:51:
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected identifier before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected `}' before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected unqualified-id before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:70: error: expected unqualified-id before ‘public’
/usr/include/boost/date_time/microsec_time_clock.hpp:79: error: ‘time_type’ does not name a type
/usr/include/boost/date_time/microsec_time_clock.hpp:84: error: expected unqualified-id before ‘private’
sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input
sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input
make[2]: *** [threaded_analyzer.o] Error 1

Then edit freeling/src/main/Makefile.am and remove $(TH_AN) from the bin_PROGRAMS.

Old instructions (before 2016)[edit]

Prerequisites[edit]

Debian/buntu[edit]

Install freeling-3.1 from the tarball; prerequisites include

sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \
   libboost-program-options-dev libboost-thread-dev

Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like:

dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system

Then

autoreconf -fi
./configure --prefix=$HOME/PREFIX/freeling
make
make install


Add the nightly repo and do

sudo apt-get install apertium-all-dev foma-bin libfoma0-dev

Then just

git clone https://github.com/matxin/matxin
cd matxin
export PATH="${PATH}:$HOME/PREFIX/freeling/bin
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib"
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig"
export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal"
autoreconf -fi
./configure --prefix=$HOME/PREFIX/matxin
make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib"
having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug?

old prerequisites[edit]

  • BerkleyDB — sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev)
  • libpcre3 — sudo apt-get install libpcre3-dev

Install the following libraries in <prefix>,

If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix>

Building[edit]

Checkout
$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin

Then do the usual:

$ ./configure --prefix=<prefix>
$ make

After you've got it built, do:

$ su
# export LD_LIBRARY_PATH=/usr/local/lib
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
# make install

Mac OS X[edit]

If you've installed boost etc. with Macports, for the configure step do:

env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure

(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does)

Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat.

Executing[edit]

The default for MATXIN_DIR, if you have not specified a prefix is /usr/local/bin, if you have not specified a prefix, then you should cd /usr/local/bin to make the tests.

Bundled with Matxin there's a script called Matxin_translator which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations.

$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg

There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer.

txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer.

Calling it with -h or --help displays help information. You could write the following to show how the word "gener" is analysed:

echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg

For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules.

Spanish-Basque[edit]

<prefix> is typically /usr/local

$ export MATXIN_DIR=<prefix>  
$ echo "Esto es una prueba" |  \
./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_verb   -f $MATXIN_DIR/share/matxin/config/es-eu.cfg  | \
./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./reFormat

Da proba bat hau

English-Basque[edit]

Using the above example for English-Basque looks:

$ cat src/matxinallen.sh
src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_verb   -f $MATXIN_DIR/share/matxin/config/en-eu.cfg  | \
src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/reFormat

$ echo "This is a test" |  sh src/matxin_allen.sh
Hau proba da

$ echo "How are you?" |  sh src/matxin_allen.sh
Nola zu da?

$ echo "Otto plays football and tennis" | sh src/matxin_allen.sh
Otto-ak jokatzen du futbola tenis-a eta

Speed[edit]

Between 25--30 words per second.

Troubleshooting[edit]

libdb[edit]

g++  -g -O2 -ansi -march=i686 -O3 -fno-pic     
-fomit-frame-pointer  -L/usr/local/lib -L/usr/lib
-o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet
-lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre
 
/usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**)

            [and a lot of similar lines]

Try installing libdb4.8++-dev[1]

libcfg+[edit]

If you get the following error:

ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC

Delete the directory, and start from scratch, this time when you call make, call it with make CFLAGS=-fPIC


Various errors[edit]

If you get the error:

g++ -DHAVE_CONFIG_H -I. -I..   -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2  -g -O2 -ansi -march=i686 -O3 
-fno-pic              -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C

--->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden
 In file included from Analyzer.C:9:
 config.h: In constructor 'config::config(char**)':
 config.h:413: warning: deprecated conversion from string constant to 'char*'
 Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)':
 Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined
 Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s...

Then change the header files in src/Analyzer.C to:

//#include "freeling.h"

#include "util.h"
#include "tokenizer.h"
#include "splitter.h"
#include "maco.h"
#include "nec.h"
#include "senses.h"
#include "tagger.h"
#include "hmm_tagger.h"
#include "relax_tagger.h"
#include "chart_parser.h"
#include "maco_options.h"
#include "dependencies.h"

Upon finding yourself battling the following compile problem,

Analyzer.C: In function ‘int main(int, char**)’:
Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’
/home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool)
/home/fran/local/include/hmm_tagger.h:84: note:                 hmm_tagger::hmm_tagger(const hmm_tagger&)
Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’
/home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool)
/home/fran/local/include/relax_tagger.h:51: note:                 relax_tagger::relax_tagger(const relax_tagger&)
Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’
/home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&)
/home/fran/local/include/senses.h:45: note:                 senses::senses(const senses&)

Make the following changes in the file src/Analyzer.C:

   if (cfg.TAGGER_which == HMM)
-    tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
+    tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false);
   else if (cfg.TAGGER_which == RELAX)
-    tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter, 
+    tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter,
 			      cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon,
-			      cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect); 
+			      false); 
 
   if (cfg.NEC_NEClassification)
     neclass = new nec("NP", cfg.NEC_FilePrefix);
 
   if (cfg.SENSE_SenseAnnotation!=NONE)
-    sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis);
+    sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis);

Then probably there will be issues with actually running Matxin.

If you get the error:

config.h:33:29: error: freeling/traces.h: No such file or directory

Then change the header files in src/config.h to:

//#include "freeling/traces.h"
#include "traces.h"

If you get this error:

$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg 
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found.

You can change the tagger from the RelaxCG to HMM, edit the file <prefix>/share/matxin/config/es-eu.cfg, and change:

#### Tagger options
#Tagger=relax
Tagger=hmm

Then there might be a problem in the dependency grammar:

$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg 
DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto

The easiest thing to do here is to just remove references to the stuff it complains about:

cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep
cat newdep | grep -v d\.class > newdep2
cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat

Error in db[edit]

If you get:

  • SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db

rebuild senses16.deb from source:

  • cat senses16.src | indexdict senses16.db
  • (remove senses16.db before rebuild)

Error when reading xml files[edit]

If xml files read does not work, you get error like: ERROR: invalid document: found <corpus i> when <corpus> was expected..., do following in src/XML_reader.cc do:

1. add following subroutine after line 43:

wstring 
mystows(string const &str)
{
   wchar_t* result = new wchar_t[str.size()+1];
   size_t retval = mbstowcs(result, str.c_str(), str.size());
   result[retval] = L'\0';
   wstring result2 = result;
   delete[] result;
   return result2;
}

2. replace all occurencies of

XMLParseUtil::stows

with

mystows

Version 3.1.1 of lttoolbox does not have this error any more.

Results of the individual steps:[edit]

--------------------Step1
en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8' ?>
<corpus>
<SENTENCE ord='1' alloc='0'>
<CHUNK ord='2' alloc='5' type='grup-verb' si='top'>
  <NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'>
  </NODE>
  <CHUNK ord='1' alloc='0' type='sn' si='subj'>
    <NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'>
    </NODE>
  </CHUNK>
  <CHUNK ord='3' alloc='8' type='sn' si='att'>
    <NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'>
      <NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'>
      </NODE>
    </NODE>
  </CHUNK>
</CHUNK>
</SENTENCE>
</corpus>
---------------------Step2
[glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f 
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
  <SENTENCE ref='1' alloc='0'>
    <CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
       <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0'  pos='[ADI][SIN]'>
       </NODE>
      <CHUNK ref='1' type='is' alloc='0' si='subj'>
         <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
         </NODE>
      </CHUNK>
      <CHUNK ref='3' type='is' alloc='8' si='att'>
         <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]'  mi='[NUMS]' sem='[BIZ-]'>
           <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
           </NODE>
         </NODE>
      </CHUNK>
    </CHUNK>
  </SENTENCE>
</corpus>
----------- step3
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP4
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP5
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP6
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP7
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP8
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP9
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------- step10
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP11
Hau proba bat da