Difference between revisions of "Talk:Matxin"

Latest revision as of 12:19, 5 May 2016

Documentation Descripción del sistema de traducción es-eu Matxin page 30, 6.1: image is missing (Diagrama1.dia) Muki987 12:18, 9 April 2009 (UTC)

Please could you email this to the developers of Matxin. I have emailed you their contact details. - Francis Tyers 12:39, 9 April 2009 (UTC)

New instructions (2012)[edit]

a) install Foma

   svn co http://devel.cpl.upc.edu/freeling/svn/trunk freeling
   svn co http://matxin.svn.sourceforge.net/svnroot/matxin/trunk matxin

In freeling if you get an error like:

g++ -DPACKAGE_NAME=\"FreeLing\" -DPACKAGE_TARNAME=\"freeling\" -DPACKAGE_VERSION=\"3.0\" -DPACKAGE_STRING=\"FreeLing\ 3.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE=\"freeling\" -DVERSION=\"3.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_BOOST_REGEX_HPP=1 -DHAVE_BOOST_REGEX_ICU_HPP=1 -DHAVE_BOOST_FILESYSTEM_HPP=1 -DHAVE_BOOST_PROGRAM_OPTIONS_HPP=1 -DHAVE_BOOST_THREAD_HPP=1 -DHAVE_STDBOOL_H=1 -DSTDC_HEADERS=1 -I. -I../../src/include -I../../src/libfreeling/corrector   -O3 -Wall  -MT threaded_analyzer.o -MD -MP -MF .deps/threaded_analyzer.Tpo -c -o threaded_analyzer.o `test -f 'sample_analyzer/threaded_analyzer.cc' || echo './'`sample_analyzer/threaded_analyzer.cc
In file included from /usr/include/boost/thread/thread_time.hpp:9,
                 from /usr/include/boost/thread/locks.hpp:11,
                 from /usr/include/boost/thread/pthread/mutex.hpp:11,
                 from /usr/include/boost/thread/mutex.hpp:16,
                 from /usr/include/boost/thread/pthread/thread.hpp:14,
                 from /usr/include/boost/thread/thread.hpp:17,
                 from /usr/include/boost/thread.hpp:12,
                 from sample_analyzer/threaded_processor.h:35,
                 from sample_analyzer/threaded_analyzer.cc:51:
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected identifier before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected `}' before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected unqualified-id before numeric constant
/usr/include/boost/date_time/microsec_time_clock.hpp:70: error: expected unqualified-id before ‘public’
/usr/include/boost/date_time/microsec_time_clock.hpp:79: error: ‘time_type’ does not name a type
/usr/include/boost/date_time/microsec_time_clock.hpp:84: error: expected unqualified-id before ‘private’
sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input
sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input
make[2]: *** [threaded_analyzer.o] Error 1

Then edit freeling/src/main/Makefile.am and remove $(TH_AN) from the bin_PROGRAMS.

Old instructions (before 2016)[edit]

Prerequisites[edit]

Debian/buntu[edit]

Install freeling-3.1 from the tarball; prerequisites include

sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \
   libboost-program-options-dev libboost-thread-dev

Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like:

dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system

Then

autoreconf -fi
./configure --prefix=$HOME/PREFIX/freeling
make
make install

Add the nightly repo and do

sudo apt-get install apertium-all-dev foma-bin libfoma0-dev

Then just

git clone https://github.com/matxin/matxin
cd matxin
export PATH="${PATH}:$HOME/PREFIX/freeling/bin
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib"
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig"
export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal"
autoreconf -fi
./configure --prefix=$HOME/PREFIX/matxin
make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib"

having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug?

old prerequisites[edit]

BerkleyDB — sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev)
libpcre3 — sudo apt-get install libpcre3-dev

Install the following libraries in <prefix>,

libcfg+ — http://platon.sk/upload/_projects/00003/libcfg+-0.6.2.tar.gz
libomlet (from SVN) — (svn co http://devel.cpl.upc.edu/freeling/svn/latest/omlet)
libfries (from SVN) — (svn co http://devel.cpl.upc.edu/freeling/svn/latest/fries)
FreeLing (from SVN) — (svn co http://devel.cpl.upc.edu/freeling/svn/latest/freeling)

If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix>

lttoolbox (from SVN) — (svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox) Take as a minimum version 3.1.1; 3.1.0 and lower versions cause data error and error messages in Matxin due to a missing string close.

Building[edit]

Checkout

$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin

Then do the usual:

$ ./configure --prefix=<prefix>
$ make

After you've got it built, do:

$ su
# export LD_LIBRARY_PATH=/usr/local/lib
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
# make install

Mac OS X[edit]

If you've installed boost etc. with Macports, for the configure step do:

env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure

(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does)

Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat.

Executing[edit]

The default for MATXIN_DIR, if you have not specified a prefix is /usr/local/bin, if you have not specified a prefix, then you should cd /usr/local/bin to make the tests.

Bundled with Matxin there's a script called Matxin_translator which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations.

$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg

There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer.

txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer.

Calling it with -h or --help displays help information. You could write the following to show how the word "gener" is analysed:

echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg

For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules.

Spanish-Basque[edit]

<prefix> is typically /usr/local

$ export MATXIN_DIR=<prefix>  
$ echo "Esto es una prueba" |  \
./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./ST_verb   -f $MATXIN_DIR/share/matxin/config/es-eu.cfg  | \
./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
./reFormat

Da proba bat hau

English-Basque[edit]

Using the above example for English-Basque looks:

$ cat src/matxinallen.sh
src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/ST_verb   -f $MATXIN_DIR/share/matxin/config/en-eu.cfg  | \
src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
src/reFormat

$ echo "This is a test" |  sh src/matxin_allen.sh
Hau proba da

$ echo "How are you?" |  sh src/matxin_allen.sh
Nola zu da?

$ echo "Otto plays football and tennis" | sh src/matxin_allen.sh
Otto-ak jokatzen du futbola tenis-a eta

Speed[edit]

Between 25--30 words per second.

Troubleshooting[edit]

libdb[edit]

g++  -g -O2 -ansi -march=i686 -O3 -fno-pic     
-fomit-frame-pointer  -L/usr/local/lib -L/usr/lib
-o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet
-lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre
 
/usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**)

            [and a lot of similar lines]

Try installing libdb4.8++-dev[1]

libcfg+[edit]

If you get the following error:

ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC

Delete the directory, and start from scratch, this time when you call make, call it with make CFLAGS=-fPIC

Various errors[edit]

If you get the error:

g++ -DHAVE_CONFIG_H -I. -I..   -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2  -g -O2 -ansi -march=i686 -O3 
-fno-pic              -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C

--->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden
 In file included from Analyzer.C:9:
 config.h: In constructor 'config::config(char**)':
 config.h:413: warning: deprecated conversion from string constant to 'char*'
 Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)':
 Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined
 Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s...

Then change the header files in src/Analyzer.C to:

//#include "freeling.h"

#include "util.h"
#include "tokenizer.h"
#include "splitter.h"
#include "maco.h"
#include "nec.h"
#include "senses.h"
#include "tagger.h"
#include "hmm_tagger.h"
#include "relax_tagger.h"
#include "chart_parser.h"
#include "maco_options.h"
#include "dependencies.h"

Upon finding yourself battling the following compile problem,

Analyzer.C: In function ‘int main(int, char**)’:
Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’
/home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool)
/home/fran/local/include/hmm_tagger.h:84: note:                 hmm_tagger::hmm_tagger(const hmm_tagger&)
Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’
/home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool)
/home/fran/local/include/relax_tagger.h:51: note:                 relax_tagger::relax_tagger(const relax_tagger&)
Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’
/home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&)
/home/fran/local/include/senses.h:45: note:                 senses::senses(const senses&)

Make the following changes in the file src/Analyzer.C:

   if (cfg.TAGGER_which == HMM)
-    tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
+    tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false);
   else if (cfg.TAGGER_which == RELAX)
-    tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter, 
+    tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter,
 			      cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon,
-			      cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect); 
+			      false); 
 
   if (cfg.NEC_NEClassification)
     neclass = new nec("NP", cfg.NEC_FilePrefix);
 
   if (cfg.SENSE_SenseAnnotation!=NONE)
-    sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis);
+    sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis);

Then probably there will be issues with actually running Matxin.

If you get the error:

config.h:33:29: error: freeling/traces.h: No such file or directory

Then change the header files in src/config.h to:

//#include "freeling/traces.h"
#include "traces.h"

If you get this error:

$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg 
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found.
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found.

You can change the tagger from the RelaxCG to HMM, edit the file <prefix>/share/matxin/config/es-eu.cfg, and change:

#### Tagger options
#Tagger=relax
Tagger=hmm

Then there might be a problem in the dependency grammar:

$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg 
DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto

The easiest thing to do here is to just remove references to the stuff it complains about:

cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep
cat newdep | grep -v d\.class > newdep2
cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat

Error in db[edit]

If you get:

SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db

rebuild senses16.deb from source:

cat senses16.src | indexdict senses16.db
(remove senses16.db before rebuild)

Error when reading xml files[edit]

If xml files read does not work, you get error like: ERROR: invalid document: found <corpus i> when <corpus> was expected..., do following in src/XML_reader.cc do:

1. add following subroutine after line 43:

wstring 
mystows(string const &str)
{
   wchar_t* result = new wchar_t[str.size()+1];
   size_t retval = mbstowcs(result, str.c_str(), str.size());
   result[retval] = L'\0';
   wstring result2 = result;
   delete[] result;
   return result2;
}

2. replace all occurencies of

XMLParseUtil::stows

with

mystows

Version 3.1.1 of lttoolbox does not have this error any more.

Results of the individual steps:[edit]

--------------------Step1
en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8' ?>
<corpus>
<SENTENCE ord='1' alloc='0'>
<CHUNK ord='2' alloc='5' type='grup-verb' si='top'>
  <NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'>
  </NODE>
  <CHUNK ord='1' alloc='0' type='sn' si='subj'>
    <NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'>
    </NODE>
  </CHUNK>
  <CHUNK ord='3' alloc='8' type='sn' si='att'>
    <NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'>
      <NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'>
      </NODE>
    </NODE>
  </CHUNK>
</CHUNK>
</SENTENCE>
</corpus>

---------------------Step2
[glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f 
$MATXIN_DIR/share/matxin/config/es-eu.cfg
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
  <SENTENCE ref='1' alloc='0'>
    <CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
       <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0'  pos='[ADI][SIN]'>
       </NODE>
      <CHUNK ref='1' type='is' alloc='0' si='subj'>
         <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
         </NODE>
      </CHUNK>
      <CHUNK ref='3' type='is' alloc='8' si='att'>
         <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]'  mi='[NUMS]' sem='[BIZ-]'>
           <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
           </NODE>
         </NODE>
      </CHUNK>
    </CHUNK>
  </SENTENCE>
</corpus>

----------- step3
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP4
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP5
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP6
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP7
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP8
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP9
<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------- step10
<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
<NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
<NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

-------------STEP11
Hau proba bat da

Difference between revisions of "Talk:Matxin"

Latest revision as of 12:19, 5 May 2016

Contents

New instructions (2012)[edit]

Old instructions (before 2016)[edit]

Prerequisites[edit]

Debian/buntu[edit]

old prerequisites[edit]

Building[edit]

Mac OS X[edit]

Executing[edit]

Spanish-Basque[edit]

English-Basque[edit]

Speed[edit]

Troubleshooting[edit]

libdb[edit]

libcfg+[edit]

Various errors[edit]

Error in db[edit]

Error when reading xml files[edit]

Results of the individual steps:[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 41: / Line 41: @@
 ==Old instructions (before 2016)==
+==Prerequisites==
+===Debian/buntu===
+Install freeling-3.1 from the tarball; prerequisites include
+<pre>
+sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \
+   libboost-program-options-dev libboost-thread-dev
+</pre>
+Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like:
+<pre>dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system</pre>
+Then <pre>autoreconf -fi
+./configure --prefix=$HOME/PREFIX/freeling
+make
+make install
+</pre>
+Add the [[Debian|nightly repo]] and do
+<pre>
+sudo apt-get install apertium-all-dev foma-bin libfoma0-dev
+</pre>
+Then just
+<pre>
+git clone https://github.com/matxin/matxin
+cd matxin
+export PATH="${PATH}:$HOME/PREFIX/freeling/bin
+export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib"
+export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig"
+export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal"
+autoreconf -fi
+./configure --prefix=$HOME/PREFIX/matxin
+make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib"
+</pre>
+: having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug?
+=== old prerequisites ===
+* BerkleyDB &mdash; sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev)
+* libpcre3 &mdash; sudo apt-get install libpcre3-dev
+Install the following libraries in <prefix>,
+* libcfg+ &mdash; http://platon.sk/upload/_projects/00003/libcfg+-0.6.2.tar.gz
+* libomlet (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/omlet</code>)
+* libfries (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/fries</code>)
+* FreeLing (from SVN) &mdash; (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/freeling</code>)
+:If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix>
+* [[lttoolbox]] (from SVN) &mdash; (<code>svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox</code>) Take as a minimum version 3.1.1; 3.1.0 and lower versions cause data error and error messages in Matxin due to a missing string close.
+==Building==
+;Checkout
+<pre>
+$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin
+</pre>
+Then do the usual:
+<pre>
+$ ./configure --prefix=<prefix>
+$ make
+</pre>
+After you've got it built, do:
+<pre>
+$ su
+# export LD_LIBRARY_PATH=/usr/local/lib
+# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
+# make install
+</pre>
+===Mac OS X===
+If you've installed boost etc. with Macports, for the configure step do:
+ env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure
+(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does)
+Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat.
+== Executing ==
+The default for <code>MATXIN_DIR</code>, if you have not specified a prefix is <code>/usr/local/bin</code>, if you have not specified a prefix, then you should <code>cd /usr/local/bin</code> to make the tests.
+Bundled with Matxin there's a script called <code>Matxin_translator</code> which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations.
+<pre>
+$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg
+</pre>
+There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer.
+txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer.
+Calling it with -h or --help displays help information.
+You could write the following to show how the word "gener" is analysed:
+ echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg
+For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules.
+=== Spanish-Basque ===
+<prefix> is typically /usr/local
+<pre>
+$ export MATXIN_DIR=<prefix>
+$ echo "Esto es una prueba" |  \
+./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./ST_verb   -f $MATXIN_DIR/share/matxin/config/es-eu.cfg  | \
+./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \
+./reFormat
+Da proba bat hau
+</pre>
+=== English-Basque ===
+Using the above example for English-Basque looks:
+<pre>
+$ cat src/matxinallen.sh
+src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/ST_verb   -f $MATXIN_DIR/share/matxin/config/en-eu.cfg  | \
+src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \
+src/reFormat
+$ echo "This is a test" |  sh src/matxin_allen.sh
+Hau proba da
+$ echo "How are you?" |  sh src/matxin_allen.sh
+Nola zu da?
+$ echo "Otto plays football and tennis" | sh src/matxin_allen.sh
+Otto-ak jokatzen du futbola tenis-a eta
+</pre>
+==Speed==
+Between 25--30 words per second.
+==Troubleshooting==
+===libdb===
+<pre>
+g++  -g -O2 -ansi -march=i686 -O3 -fno-pic
+-fomit-frame-pointer  -L/usr/local/lib -L/usr/lib
+-o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet
+-lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre
+/usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**)
+            [and a lot of similar lines]
+</pre>
+Try installing libdb4.8++-dev[http://sourceforge.net/mailarchive/forum.php?thread_name=1313552553.4706.7316.camel%40eki.dlsi.ua.es&forum_name=matxin-devel]
+===libcfg+===
+If you get the following error:
+<pre>
+ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
+</pre>
+Delete the directory, and start from scratch, this time when you call make, call it with <code>make CFLAGS=-fPIC</code>
+===Various errors===
+If you get the error:
+<pre>
+g++ -DHAVE_CONFIG_H -I. -I..   -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2  -g -O2 -ansi -march=i686 -O3
+-fno-pic              -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C
+--->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden
+ In file included from Analyzer.C:9:
+ config.h: In constructor 'config::config(char**)':
+ config.h:413: warning: deprecated conversion from string constant to 'char*'
+ Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)':
+ Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined
+ Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s...
+</pre>
+Then change the header files in <code>src/Analyzer.C</code> to:
+<pre>
+//#include "freeling.h"
+#include "util.h"
+#include "tokenizer.h"
+#include "splitter.h"
+#include "maco.h"
+#include "nec.h"
+#include "senses.h"
+#include "tagger.h"
+#include "hmm_tagger.h"
+#include "relax_tagger.h"
+#include "chart_parser.h"
+#include "maco_options.h"
+#include "dependencies.h"
+</pre>
+Upon finding yourself battling the following compile problem,
+<pre>
+Analyzer.C: In function ‘int main(int, char**)’:
+Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’
+/home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool)
+/home/fran/local/include/hmm_tagger.h:84: note:                 hmm_tagger::hmm_tagger(const hmm_tagger&)
+Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’
+/home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool)
+/home/fran/local/include/relax_tagger.h:51: note:                 relax_tagger::relax_tagger(const relax_tagger&)
+Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’
+/home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&)
+/home/fran/local/include/senses.h:45: note:                 senses::senses(const senses&)
+</pre>
+Make the following changes in the file <code>src/Analyzer.C</code>:
+<pre>
+   if (cfg.TAGGER_which == HMM)
+-    tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
++    tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false);
+   else if (cfg.TAGGER_which == RELAX)
+-    tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter,
++    tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter,
+ 			      cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon,
+-			      cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect);
++			      false);
+   if (cfg.NEC_NEClassification)
+     neclass = new nec("NP", cfg.NEC_FilePrefix);
+   if (cfg.SENSE_SenseAnnotation!=NONE)
+-    sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis);
++    sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis);
+</pre>
+Then probably there will be issues with actually running Matxin.
+If you get the error:
+<pre>
+config.h:33:29: error: freeling/traces.h: No such file or directory
+</pre>
+Then change the header files in <code>src/config.h</code> to:
+<pre>
+//#include "freeling/traces.h"
+#include "traces.h"
+</pre>
+If you get this error:
+<pre>
+$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg
+Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found.
+Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found.
+Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found.
+</pre>
+You can change the tagger from the RelaxCG to HMM, edit the file <code><prefix>/share/matxin/config/es-eu.cfg</code>, and change:
+<pre>
+#### Tagger options
+#Tagger=relax
+Tagger=hmm
+</pre>
+Then there might be a problem in the dependency grammar:
+<pre>
+$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg
+DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto
+</pre>
+The easiest thing to do here is to just remove references to the stuff it complains about:
+<pre>
+cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep
+cat newdep | grep -v d\.class > newdep2
+cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat
+</pre>
+===Error in db===
+If you get:
+*SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db
+rebuild senses16.deb from source:
+*cat senses16.src | indexdict senses16.db
+* (remove senses16.db before rebuild)
+===Error when reading xml files===
+If xml files read does not work, you get error like:
+<i>ERROR: invalid document: found <corpus i> when <corpus> was expected...</i>,
+do following in src/XML_reader.cc do:
+. add following subroutine after line 43:
+<pre>
+wstring
+mystows(string const &str)
+{
+   wchar_t* result = new wchar_t[str.size()+1];
+   size_t retval = mbstowcs(result, str.c_str(), str.size());
+   result[retval] = L'\0';
+   wstring result2 = result;
+   delete[] result;
+   return result2;
+}
+</pre>
+. replace all occurencies of
+<pre>
+XMLParseUtil::stows
+</pre>
+with
+<pre>
+mystows
+</pre>
+Version 3.1.1 of lttoolbox does not have this error any more.
+==Results of the individual steps:==
+<pre>
+--------------------Step1
+en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f
+$MATXIN_DIR/share/matxin/config/es-eu.cfg
+<?xml version='1.0' encoding='UTF-8' ?>
+<corpus>
+<SENTENCE ord='1' alloc='0'>
+<CHUNK ord='2' alloc='5' type='grup-verb' si='top'>
+  <NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'>
+  </NODE>
+  <CHUNK ord='1' alloc='0' type='sn' si='subj'>
+    <NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'>
+    </NODE>
+  </CHUNK>
+  <CHUNK ord='3' alloc='8' type='sn' si='att'>
+    <NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'>
+      <NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'>
+      </NODE>
+    </NODE>
+  </CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+</pre>
+<pre>
+---------------------Step2
+[glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f
+$MATXIN_DIR/share/matxin/config/es-eu.cfg
+<?xml version='1.0' encoding='UTF-8'?>
+<corpus >
+  <SENTENCE ref='1' alloc='0'>
+    <CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
+       <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0'  pos='[ADI][SIN]'>
+       </NODE>
+      <CHUNK ref='1' type='is' alloc='0' si='subj'>
+         <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+         </NODE>
+      </CHUNK>
+      <CHUNK ref='3' type='is' alloc='8' si='att'>
+         <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]'  mi='[NUMS]' sem='[BIZ-]'>
+           <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+           </NODE>
+         </NODE>
+      </CHUNK>
+    </CHUNK>
+  </SENTENCE>
+</corpus>
+</pre>
+<pre>
+----------- step3
+<?xml version='1.0' encoding='UTF-8' ?>
+<corpus >
+<SENTENCE ref='1' alloc='0'>
+<CHUNK ref='2' type='adi-kat' alloc='5' si='top'>
+<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
+</NODE>
+<CHUNK ref='1' type='is' alloc='0' si='subj'>
+<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ref='3' type='is' alloc='8' si='att'>
+<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------STEP4
+<?xml version='1.0' encoding='UTF-8' ?>
+<corpus >
+<SENTENCE ref='1' alloc='0'>
+<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
+<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
+</NODE>
+<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
+<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
+<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------STEP5
+<?xml version='1.0' encoding='UTF-8' ?>
+<corpus >
+<SENTENCE ref='1' alloc='0'>
+<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'>
+<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'>
+</NODE>
+<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
+<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'>
+<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------STEP6
+<?xml version='1.0' encoding='UTF-8' ?>
+<corpus >
+<SENTENCE ref='1' alloc='0'>
+<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
+<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
+</NODE>
+<CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'>
+<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
+<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------STEP7
+<?xml version='1.0' encoding='UTF-8' ?>
+<corpus >
+<SENTENCE ref='1' alloc='0'>
+<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
+<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
+</NODE>
+<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
+<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
+<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------STEP8
+<?xml version='1.0' encoding='UTF-8'?>
+<corpus >
+<SENTENCE ord='1' ref='1' alloc='0'>
+<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
+<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
+</NODE>
+<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
+<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
+<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------STEP9
+<?xml version='1.0' encoding='UTF-8' ?>
+<corpus >
+<SENTENCE ord='1' ref='1' alloc='0'>
+<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
+<NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
+</NODE>
+<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
+<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
+<NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------- step10
+<?xml version='1.0' encoding='UTF-8'?>
+<corpus >
+<SENTENCE ord='1' ref='1' alloc='0'>
+<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'>
+<NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
+</NODE>
+<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'>
+<NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'>
+</NODE>
+</CHUNK>
+<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'>
+<NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
+<NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'>
+</NODE>
+</NODE>
+</CHUNK>
+</CHUNK>
+</SENTENCE>
+</corpus>
+-------------STEP11
+Hau proba bat da
+</pre>