Difference between revisions of "Talk:Matxin"

From Apertium
Jump to navigation Jump to search
 
Line 41: Line 41:


==Old instructions (before 2016)==
==Old instructions (before 2016)==


==Prerequisites==

===Debian/buntu===

Install freeling-3.1 from the tarball; prerequisites include
<pre>
sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \
libboost-program-options-dev libboost-thread-dev
</pre>
Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like:
<pre>dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system</pre>
Then <pre>autoreconf -fi
./configure --prefix=$HOME/PREFIX/freeling
make
make install
</pre>


Add the [[Debian|nightly repo]] and do
<pre>
sudo apt-get install apertium-all-dev foma-bin libfoma0-dev
</pre>
Then just
<pre>
git clone https://github.com/matxin/matxin
cd matxin
export PATH="${PATH}:$HOME/PREFIX/freeling/bin
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib"
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig"
export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal"
autoreconf -fi
./configure --prefix=$HOME/PREFIX/matxin
make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib"
</pre>

: having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug?

=== old prerequisites ===
* BerkleyDB &mdash; sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev)
* libpcre3 &mdash; sudo apt-get install libpcre3-dev

Install the following libraries in <prefix>,

* libcfg+ &mdash; http://platon.sk/upload/_projects/00003/libcfg+-0.6.2.tar.gz
* libomlet (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/omlet</code>)
* libfries (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/fries</code>)
* FreeLing (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/freeling</code>)
:If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix>
* [[lttoolbox]] (from SVN) &mdash; (<code>svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox</code>) Take as a minimum version 3.1.1; 3.1.0 and lower versions cause data error and error messages in Matxin due to a missing string close.

==Building==

;Checkout

<pre>
$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin
</pre>

Then do the usual:

<pre>
$ ./configure --prefix=<prefix>
$ make
</pre>

After you've got it built, do:

<pre>
$ su
# export LD_LIBRARY_PATH=/usr/local/lib
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
# make install
</pre>

===Mac OS X===
If you've installed boost etc. with Macports, for the configure step do:

env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure

(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does)

Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat.

== Executing ==

The default for <code>MATXIN_DIR</code>, if you have not specified a prefix is <code>/usr/local/bin</code>, if you have not specified a prefix, then you should <code>cd /usr/local/bin</code> to make the tests.

Bundled with Matxin there's a script called <code>Matxin_translator</code> which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations.

<pre>
$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg
</pre>

There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer.

txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer.

Calling it with -h or --help displays help information.
You could write the following to show how the word "gener" is analysed:

echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg

For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules.

=== Spanish-Basque ===
<prefix> is typically /usr/local

<pre>
$ export MATXIN_DIR=<prefix>
$ echo "Esto es una prueba" | \
./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_verb -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./reFormat

Da proba bat hau
</pre>

=== English-Basque ===

Using the above example for English-Basque looks:

<pre>
$ cat src/matxinallen.sh
src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_verb -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/reFormat

$ echo "This is a test" | sh src/matxin_allen.sh
Hau proba da

$ echo "How are you?" | sh src/matxin_allen.sh
Nola zu da?

$ echo "Otto plays football and tennis" | sh src/matxin_allen.sh
Otto-ak jokatzen du futbola tenis-a eta
</pre>

==Speed==

Between 25--30 words per second.

==Troubleshooting==

===libdb===
<pre>
g++ -g -O2 -ansi -march=i686 -O3 -fno-pic
-fomit-frame-pointer -L/usr/local/lib -L/usr/lib
-o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet
-lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre
/usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**)

[and a lot of similar lines]
</pre>

Try installing libdb4.8++-dev[http://sourceforge.net/mailarchive/forum.php?thread_name=1313552553.4706.7316.camel%40eki.dlsi.ua.es&forum_name=matxin-devel]

===libcfg+===

If you get the following error:

<pre>
ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
</pre>

Delete the directory, and start from scratch, this time when you call make, call it with <code>make CFLAGS=-fPIC</code>


===Various errors===

If you get the error:

<pre>
g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2 -g -O2 -ansi -march=i686 -O3
-fno-pic -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C

--->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden
In file included from Analyzer.C:9:
config.h: In constructor 'config::config(char**)':
config.h:413: warning: deprecated conversion from string constant to 'char*'
Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)':
Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined
Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s...
</pre>

Then change the header files in <code>src/Analyzer.C</code> to:

<pre>
//#include "freeling.h"

#include "util.h"
#include "tokenizer.h"
#include "splitter.h"
#include "maco.h"
#include "nec.h"
#include "senses.h"
#include "tagger.h"
#include "hmm_tagger.h"
#include "relax_tagger.h"
#include "chart_parser.h"
#include "maco_options.h"
#include "dependencies.h"
</pre>

Upon finding yourself battling the following compile problem,

<pre>
Analyzer.C: In function ‘int main(int, char**)’:
Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’
/home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool)
/home/fran/local/include/hmm_tagger.h:84: note: hmm_tagger::hmm_tagger(const hmm_tagger&)
Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’
/home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool)
/home/fran/local/include/relax_tagger.h:51: note: relax_tagger::relax_tagger(const relax_tagger&)
Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’
/home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&)
/home/fran/local/include/senses.h:45: note: senses::senses(const senses&)
</pre>

Make the following changes in the file <code>src/Analyzer.C</code>:

<pre>
if (cfg.TAGGER_which == HMM)
- tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
+ tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false);
else if (cfg.TAGGER_which == RELAX)
- tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter,
+ tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter,
cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon,
- cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
+ false);
if (cfg.NEC_NEClassification)
neclass = new nec("NP", cfg.NEC_FilePrefix);
if (cfg.SENSE_SenseAnnotation!=NONE)
- sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis);
+ sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis);
</pre>

Then probably there will be issues with actually running Matxin.

If you get the error:

<pre>
config.h:33:29: error: freeling/traces.h: No such file or directory
</pre>

Then change the header files in <code>src/config.h</code> to:

<pre>
//#include "freeling/traces.h"
#include "traces.h"
</pre>

If you get this error:

<pre>
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found.
</pre>

You can change the tagger from the RelaxCG to HMM, edit the file <code><prefix>/share/matxin/config/es-eu.cfg</code>, and change:

<pre>
#### Tagger options
#Tagger=relax
Tagger=hmm
</pre>

Then there might be a problem in the dependency grammar:

<pre>
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg
DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto
</pre>

The easiest thing to do here is to just remove references to the stuff it complains about:

<pre>
cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep
cat newdep | grep -v d\.class > newdep2
cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat
</pre>

===Error in db===

If you get:
*SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db

rebuild senses16.deb from source:
*cat senses16.src | indexdict senses16.db
* (remove senses16.db before rebuild)

===Error when reading xml files===

If xml files read does not work, you get error like:
<i>ERROR: invalid document: found <corpus i> when <corpus> was expected...</i>,
do following in src/XML_reader.cc do:

1. add following subroutine after line 43:
<pre>
wstring
mystows(string const &str)
{
wchar_t* result = new wchar_t[str.size()+1];
size_t retval = mbstowcs(result, str.c_str(), str.size());
result[retval] = L'\0';
wstring result2 = result;
delete[] result;
return result2;
}
</pre>
2. replace all occurencies of
<pre>
XMLParseUtil::stows
</pre>

with
<pre>
mystows
</pre>
Version 3.1.1 of lttoolbox does not have this error any more.

==Results of the individual steps:==
<pre>
--------------------Step1
en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8' ?>
<corpus>
<SENTENCE ord='1' alloc='0'>
<CHUNK ord='2' alloc='5' type='grup-verb' si='top'>
<NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'>
</NODE>
<CHUNK ord='1' alloc='0' type='sn' si='subj'>
<NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'>
</NODE>
</CHUNK>
<CHUNK ord='3' alloc='8' type='sn' si='att'>
<NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'>
<NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>
</corpus>
</pre>

<pre>
---------------------Step2
[glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>
</corpus>
</pre>

<pre>
----------- step3
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP4
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP5
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP6
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP7
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP8
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP9
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------- step10
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP11
Hau proba bat da

</pre>

Latest revision as of 12:19, 5 May 2016

Documentation Descripción del sistema de traducción es-eu Matxin page 30, 6.1: image is missing (Diagrama1.dia) Muki987 12:18, 9 April 2009 (UTC)

Please could you email this to the developers of Matxin. I have emailed you their contact details. - Francis Tyers 12:39, 9 April 2009 (UTC)

New instructions (2012)[edit]

a) install Foma

   svn co http://devel.cpl.upc.edu/freeling/svn/trunk freeling
   svn co http://matxin.svn.sourceforge.net/svnroot/matxin/trunk matxin

In freeling if you get an error like:

g++ -DPACKAGE_NAME=\"FreeLing\" -DPACKAGE_TARNAME=\"freeling\" -DPACKAGE_VERSION=\"3.0\" -DPACKAGE_STRING=\"FreeLing\ 3.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE=\"freeling\" -DVERSION=\"3.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_BOOST_REGEX_HPP=1 -DHAVE_BOOST_REGEX_ICU_HPP=1 -DHAVE_BOOST_FILESYSTEM_HPP=1 -DHAVE_BOOST_PROGRAM_OPTIONS_HPP=1 -DHAVE_BOOST_THREAD_HPP=1 -DHAVE_STDBOOL_H=1 -DSTDC_HEADERS=1 -I. -I../../src/include -I../../src/libfreeling/corrector   -O3 -Wall  -MT threaded_analyzer.o -MD -MP -MF .deps/threaded_analyzer.Tpo -c -o threaded_analyzer.o `test -f 'sample_analyzer/threaded_analyzer.cc' || echo './'`sample_analyzer/threaded_analyzer.cc
In file included from /usr/include/boost/thread/thread_time.hpp:9,
                 from /usr/include/boost/thread/locks.hpp:11,
                 from /usr/include/boost/thread/pthread/mutex.hpp:11,
                 from /usr/include/boost/thread/mutex.hpp:16,
                 from /usr/include/boost/thread/pthread/thread.hpp:14,
                 from /usr/include/boost/thread/thread.hpp:17,
                 from /usr/include/boost/thread.hpp:12,
                 from sample_analyzer/threaded_processor.h:35,
                 from sample_analyzer/threaded_analyzer.cc:51:
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected identifier before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected `}' before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected unqualified-id before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:70: error: expected unqualified-id before ‘public’
/usr/include/boost/date_time/microsec_time_clock.hpp:79: error: ‘time_type’ does not name a type
/usr/include/boost/date_time/microsec_time_clock.hpp:84: error: expected unqualified-id before ‘private’
sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input
sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input
make[2]: *** [threaded_analyzer.o] Error 1

Then edit freeling/src/main/Makefile.am and remove $(TH_AN) from the bin_PROGRAMS.

Old instructions (before 2016)[edit]

Prerequisites[edit]

Debian/buntu[edit]

Install freeling-3.1 from the tarball; prerequisites include

sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \
   libboost-program-options-dev libboost-thread-dev

Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like:

dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system

Then

autoreconf -fi
./configure --prefix=$HOME/PREFIX/freeling
make
make install


Add the nightly repo and do

sudo apt-get install apertium-all-dev foma-bin libfoma0-dev

Then just

git clone https://github.com/matxin/matxin
cd matxin
export PATH="${PATH}:$HOME/PREFIX/freeling/bin
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib"
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig"
export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal"
autoreconf -fi
./configure --prefix=$HOME/PREFIX/matxin
make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib"
having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug?

old prerequisites[edit]

  • BerkleyDB — sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev)
  • libpcre3 — sudo apt-get install libpcre3-dev

Install the following libraries in <prefix>,

If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix>

Building[edit]

Checkout
$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin

Then do the usual:

$ ./configure --prefix=<prefix>
$ make

After you've got it built, do:

$ su
# export LD_LIBRARY_PATH=/usr/local/lib
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
# make install

Mac OS X[edit]

If you've installed boost etc. with Macports, for the configure step do:

env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure

(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does)

Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat.

Executing[edit]

The default for MATXIN_DIR, if you have not specified a prefix is /usr/local/bin, if you have not specified a prefix, then you should cd /usr/local/bin to make the tests.

Bundled with Matxin there's a script called Matxin_translator which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations.

$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg

There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer.

txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer.

Calling it with -h or --help displays help information. You could write the following to show how the word "gener" is analysed:

echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg

For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules.

Spanish-Basque[edit]

<prefix> is typically /usr/local

$ export MATXIN_DIR=<prefix>  
$ echo "Esto es una prueba" |  \
./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_verb   -f $MATXIN_DIR/share/matxin/config/es-eu.cfg  | \
./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./reFormat

Da proba bat hau

English-Basque[edit]

Using the above example for English-Basque looks:

$ cat src/matxinallen.sh
src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_verb   -f $MATXIN_DIR/share/matxin/config/en-eu.cfg  | \
src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/reFormat

$ echo "This is a test" |  sh src/matxin_allen.sh
Hau proba da

$ echo "How are you?" |  sh src/matxin_allen.sh
Nola zu da?

$ echo "Otto plays football and tennis" | sh src/matxin_allen.sh
Otto-ak jokatzen du futbola tenis-a eta

Speed[edit]

Between 25--30 words per second.

Troubleshooting[edit]

libdb[edit]

g++  -g -O2 -ansi -march=i686 -O3 -fno-pic     
-fomit-frame-pointer  -L/usr/local/lib -L/usr/lib
-o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet
-lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre
 
/usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**)

            [and a lot of similar lines]

Try installing libdb4.8++-dev[1]

libcfg+[edit]

If you get the following error:

ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC

Delete the directory, and start from scratch, this time when you call make, call it with make CFLAGS=-fPIC


Various errors[edit]

If you get the error:

g++ -DHAVE_CONFIG_H -I. -I..   -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2  -g -O2 -ansi -march=i686 -O3 
-fno-pic              -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C

--->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden
 In file included from Analyzer.C:9:
 config.h: In constructor 'config::config(char**)':
 config.h:413: warning: deprecated conversion from string constant to 'char*'
 Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)':
 Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined
 Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s...

Then change the header files in src/Analyzer.C to:

//#include "freeling.h"

#include "util.h"
#include "tokenizer.h"
#include "splitter.h"
#include "maco.h"
#include "nec.h"
#include "senses.h"
#include "tagger.h"
#include "hmm_tagger.h"
#include "relax_tagger.h"
#include "chart_parser.h"
#include "maco_options.h"
#include "dependencies.h"

Upon finding yourself battling the following compile problem,

Analyzer.C: In function ‘int main(int, char**)’:
Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’
/home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool)
/home/fran/local/include/hmm_tagger.h:84: note:                 hmm_tagger::hmm_tagger(const hmm_tagger&)
Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’
/home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool)
/home/fran/local/include/relax_tagger.h:51: note:                 relax_tagger::relax_tagger(const relax_tagger&)
Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’
/home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&)
/home/fran/local/include/senses.h:45: note:                 senses::senses(const senses&)

Make the following changes in the file src/Analyzer.C:

   if (cfg.TAGGER_which == HMM)
-    tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
+    tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false);
   else if (cfg.TAGGER_which == RELAX)
-    tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter, 
+    tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter,
 			      cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon,
-			      cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect); 
+			      false); 
 
   if (cfg.NEC_NEClassification)
     neclass = new nec("NP", cfg.NEC_FilePrefix);
 
   if (cfg.SENSE_SenseAnnotation!=NONE)
-    sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis);
+    sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis);

Then probably there will be issues with actually running Matxin.

If you get the error:

config.h:33:29: error: freeling/traces.h: No such file or directory

Then change the header files in src/config.h to:

//#include "freeling/traces.h"
#include "traces.h"

If you get this error:

$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg 
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found.

You can change the tagger from the RelaxCG to HMM, edit the file <prefix>/share/matxin/config/es-eu.cfg, and change:

#### Tagger options
#Tagger=relax
Tagger=hmm

Then there might be a problem in the dependency grammar:

$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg 
DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto

The easiest thing to do here is to just remove references to the stuff it complains about:

cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep
cat newdep | grep -v d\.class > newdep2
cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat

Error in db[edit]

If you get:

  • SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db

rebuild senses16.deb from source:

  • cat senses16.src | indexdict senses16.db
  • (remove senses16.db before rebuild)

Error when reading xml files[edit]

If xml files read does not work, you get error like: ERROR: invalid document: found <corpus i> when <corpus> was expected..., do following in src/XML_reader.cc do:

1. add following subroutine after line 43:

wstring 
mystows(string const &str)
{
   wchar_t* result = new wchar_t[str.size()+1];
   size_t retval = mbstowcs(result, str.c_str(), str.size());
   result[retval] = L'\0';
   wstring result2 = result;
   delete[] result;
   return result2;
}

2. replace all occurencies of

XMLParseUtil::stows

with

mystows

Version 3.1.1 of lttoolbox does not have this error any more.

Results of the individual steps:[edit]

--------------------Step1
en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8' ?>
<corpus>
<SENTENCE ord='1' alloc='0'>
<CHUNK ord='2' alloc='5' type='grup-verb' si='top'>
  <NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'>
  </NODE>
  <CHUNK ord='1' alloc='0' type='sn' si='subj'>
    <NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'>
    </NODE>
  </CHUNK>
  <CHUNK ord='3' alloc='8' type='sn' si='att'>
    <NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'>
      <NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'>
      </NODE>
    </NODE>
  </CHUNK>
</CHUNK>
</SENTENCE>
</corpus>
---------------------Step2
[glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f 
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
  <SENTENCE ref='1' alloc='0'>
    <CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
       <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0'  pos='[ADI][SIN]'>
       </NODE>
      <CHUNK ref='1' type='is' alloc='0' si='subj'>
         <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
         </NODE>
      </CHUNK>
      <CHUNK ref='3' type='is' alloc='8' si='att'>
         <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]'  mi='[NUMS]' sem='[BIZ-]'>
           <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
           </NODE>
         </NODE>
      </CHUNK>
    </CHUNK>
  </SENTENCE>
</corpus>
----------- step3
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP4
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP5
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP6
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP7
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP8
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP9
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------- step10
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP11
Hau proba bat da