Difference between revisions of "Talk:Matxin"
Line 41: | Line 41: | ||
==Old instructions (before 2016)== |
==Old instructions (before 2016)== |
||
==Prerequisites== |
|||
===Debian/buntu=== |
|||
Install freeling-3.1 from the tarball; prerequisites include |
|||
<pre> |
|||
sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \ |
|||
libboost-program-options-dev libboost-thread-dev |
|||
</pre> |
|||
Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like: |
|||
<pre>dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system</pre> |
|||
Then <pre>autoreconf -fi |
|||
./configure --prefix=$HOME/PREFIX/freeling |
|||
make |
|||
make install |
|||
</pre> |
|||
Add the [[Debian|nightly repo]] and do |
|||
<pre> |
|||
sudo apt-get install apertium-all-dev foma-bin libfoma0-dev |
|||
</pre> |
|||
Then just |
|||
<pre> |
|||
git clone https://github.com/matxin/matxin |
|||
cd matxin |
|||
export PATH="${PATH}:$HOME/PREFIX/freeling/bin |
|||
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib" |
|||
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig" |
|||
export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal" |
|||
autoreconf -fi |
|||
./configure --prefix=$HOME/PREFIX/matxin |
|||
make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib" |
|||
</pre> |
|||
: having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug? |
|||
=== old prerequisites === |
|||
* BerkleyDB — sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev) |
|||
* libpcre3 — sudo apt-get install libpcre3-dev |
|||
Install the following libraries in <prefix>, |
|||
* libcfg+ — http://platon.sk/upload/_projects/00003/libcfg+-0.6.2.tar.gz |
|||
* libomlet (from SVN) — (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/omlet</code>) |
|||
* libfries (from SVN) — (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/fries</code>) |
|||
* FreeLing (from SVN) — (<code>svn co http://devel.cpl.upc.edu/freeling/svn/latest/freeling</code>) |
|||
:If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix> |
|||
* [[lttoolbox]] (from SVN) — (<code>svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox</code>) Take as a minimum version 3.1.1; 3.1.0 and lower versions cause data error and error messages in Matxin due to a missing string close. |
|||
==Building== |
|||
;Checkout |
|||
<pre> |
|||
$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin |
|||
</pre> |
|||
Then do the usual: |
|||
<pre> |
|||
$ ./configure --prefix=<prefix> |
|||
$ make |
|||
</pre> |
|||
After you've got it built, do: |
|||
<pre> |
|||
$ su |
|||
# export LD_LIBRARY_PATH=/usr/local/lib |
|||
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig |
|||
# make install |
|||
</pre> |
|||
===Mac OS X=== |
|||
If you've installed boost etc. with Macports, for the configure step do: |
|||
env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure |
|||
(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does) |
|||
Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat. |
|||
== Executing == |
|||
The default for <code>MATXIN_DIR</code>, if you have not specified a prefix is <code>/usr/local/bin</code>, if you have not specified a prefix, then you should <code>cd /usr/local/bin</code> to make the tests. |
|||
Bundled with Matxin there's a script called <code>Matxin_translator</code> which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations. |
|||
<pre> |
|||
$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg |
|||
</pre> |
|||
There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer. |
|||
txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer. |
|||
Calling it with -h or --help displays help information. |
|||
You could write the following to show how the word "gener" is analysed: |
|||
echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg |
|||
For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules. |
|||
=== Spanish-Basque === |
|||
<prefix> is typically /usr/local |
|||
<pre> |
|||
$ export MATXIN_DIR=<prefix> |
|||
$ echo "Esto es una prueba" | \ |
|||
./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./ST_verb -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ |
|||
./reFormat |
|||
Da proba bat hau |
|||
</pre> |
|||
=== English-Basque === |
|||
Using the above example for English-Basque looks: |
|||
<pre> |
|||
$ cat src/matxinallen.sh |
|||
src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/ST_verb -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ |
|||
src/reFormat |
|||
$ echo "This is a test" | sh src/matxin_allen.sh |
|||
Hau proba da |
|||
$ echo "How are you?" | sh src/matxin_allen.sh |
|||
Nola zu da? |
|||
$ echo "Otto plays football and tennis" | sh src/matxin_allen.sh |
|||
Otto-ak jokatzen du futbola tenis-a eta |
|||
</pre> |
|||
==Speed== |
|||
Between 25--30 words per second. |
|||
==Troubleshooting== |
|||
===libdb=== |
|||
<pre> |
|||
g++ -g -O2 -ansi -march=i686 -O3 -fno-pic |
|||
-fomit-frame-pointer -L/usr/local/lib -L/usr/lib |
|||
-o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet |
|||
-lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre |
|||
/usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**) |
|||
[and a lot of similar lines] |
|||
</pre> |
|||
Try installing libdb4.8++-dev[http://sourceforge.net/mailarchive/forum.php?thread_name=1313552553.4706.7316.camel%40eki.dlsi.ua.es&forum_name=matxin-devel] |
|||
===libcfg+=== |
|||
If you get the following error: |
|||
<pre> |
|||
ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC |
|||
</pre> |
|||
Delete the directory, and start from scratch, this time when you call make, call it with <code>make CFLAGS=-fPIC</code> |
|||
===Various errors=== |
|||
If you get the error: |
|||
<pre> |
|||
g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2 -g -O2 -ansi -march=i686 -O3 |
|||
-fno-pic -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C |
|||
--->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden |
|||
In file included from Analyzer.C:9: |
|||
config.h: In constructor 'config::config(char**)': |
|||
config.h:413: warning: deprecated conversion from string constant to 'char*' |
|||
Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)': |
|||
Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined |
|||
Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s... |
|||
</pre> |
|||
Then change the header files in <code>src/Analyzer.C</code> to: |
|||
<pre> |
|||
//#include "freeling.h" |
|||
#include "util.h" |
|||
#include "tokenizer.h" |
|||
#include "splitter.h" |
|||
#include "maco.h" |
|||
#include "nec.h" |
|||
#include "senses.h" |
|||
#include "tagger.h" |
|||
#include "hmm_tagger.h" |
|||
#include "relax_tagger.h" |
|||
#include "chart_parser.h" |
|||
#include "maco_options.h" |
|||
#include "dependencies.h" |
|||
</pre> |
|||
Upon finding yourself battling the following compile problem, |
|||
<pre> |
|||
Analyzer.C: In function ‘int main(int, char**)’: |
|||
Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’ |
|||
/home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool) |
|||
/home/fran/local/include/hmm_tagger.h:84: note: hmm_tagger::hmm_tagger(const hmm_tagger&) |
|||
Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’ |
|||
/home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool) |
|||
/home/fran/local/include/relax_tagger.h:51: note: relax_tagger::relax_tagger(const relax_tagger&) |
|||
Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’ |
|||
/home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&) |
|||
/home/fran/local/include/senses.h:45: note: senses::senses(const senses&) |
|||
</pre> |
|||
Make the following changes in the file <code>src/Analyzer.C</code>: |
|||
<pre> |
|||
if (cfg.TAGGER_which == HMM) |
|||
- tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect); |
|||
+ tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false); |
|||
else if (cfg.TAGGER_which == RELAX) |
|||
- tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter, |
|||
+ tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter, |
|||
cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon, |
|||
- cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect); |
|||
+ false); |
|||
if (cfg.NEC_NEClassification) |
|||
neclass = new nec("NP", cfg.NEC_FilePrefix); |
|||
if (cfg.SENSE_SenseAnnotation!=NONE) |
|||
- sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis); |
|||
+ sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis); |
|||
</pre> |
|||
Then probably there will be issues with actually running Matxin. |
|||
If you get the error: |
|||
<pre> |
|||
config.h:33:29: error: freeling/traces.h: No such file or directory |
|||
</pre> |
|||
Then change the header files in <code>src/config.h</code> to: |
|||
<pre> |
|||
//#include "freeling/traces.h" |
|||
#include "traces.h" |
|||
</pre> |
|||
If you get this error: |
|||
<pre> |
|||
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg |
|||
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found. |
|||
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found. |
|||
Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found. |
|||
</pre> |
|||
You can change the tagger from the RelaxCG to HMM, edit the file <code><prefix>/share/matxin/config/es-eu.cfg</code>, and change: |
|||
<pre> |
|||
#### Tagger options |
|||
#Tagger=relax |
|||
Tagger=hmm |
|||
</pre> |
|||
Then there might be a problem in the dependency grammar: |
|||
<pre> |
|||
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg |
|||
DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto |
|||
</pre> |
|||
The easiest thing to do here is to just remove references to the stuff it complains about: |
|||
<pre> |
|||
cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep |
|||
cat newdep | grep -v d\.class > newdep2 |
|||
cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat |
|||
</pre> |
|||
===Error in db=== |
|||
If you get: |
|||
*SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db |
|||
rebuild senses16.deb from source: |
|||
*cat senses16.src | indexdict senses16.db |
|||
* (remove senses16.db before rebuild) |
|||
===Error when reading xml files=== |
|||
If xml files read does not work, you get error like: |
|||
<i>ERROR: invalid document: found <corpus i> when <corpus> was expected...</i>, |
|||
do following in src/XML_reader.cc do: |
|||
1. add following subroutine after line 43: |
|||
<pre> |
|||
wstring |
|||
mystows(string const &str) |
|||
{ |
|||
wchar_t* result = new wchar_t[str.size()+1]; |
|||
size_t retval = mbstowcs(result, str.c_str(), str.size()); |
|||
result[retval] = L'\0'; |
|||
wstring result2 = result; |
|||
delete[] result; |
|||
return result2; |
|||
} |
|||
</pre> |
|||
2. replace all occurencies of |
|||
<pre> |
|||
XMLParseUtil::stows |
|||
</pre> |
|||
with |
|||
<pre> |
|||
mystows |
|||
</pre> |
|||
Version 3.1.1 of lttoolbox does not have this error any more. |
|||
==Results of the individual steps:== |
|||
<pre> |
|||
--------------------Step1 |
|||
en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f |
|||
$MATXIN_DIR/share/matxin/config/es-eu.cfg |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus> |
|||
<SENTENCE ord='1' alloc='0'> |
|||
<CHUNK ord='2' alloc='5' type='grup-verb' si='top'> |
|||
<NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'> |
|||
</NODE> |
|||
<CHUNK ord='1' alloc='0' type='sn' si='subj'> |
|||
<NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ord='3' alloc='8' type='sn' si='att'> |
|||
<NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'> |
|||
<NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
<pre> |
|||
---------------------Step2 |
|||
[glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f |
|||
$MATXIN_DIR/share/matxin/config/es-eu.cfg |
|||
<?xml version='1.0' encoding='UTF-8'?> |
|||
<corpus > |
|||
<SENTENCE ref='1' alloc='0'> |
|||
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'> |
|||
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> |
|||
</NODE> |
|||
<CHUNK ref='1' type='is' alloc='0' si='subj'> |
|||
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref='3' type='is' alloc='8' si='att'> |
|||
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
<pre> |
|||
----------- step3 |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus > |
|||
<SENTENCE ref='1' alloc='0'> |
|||
<CHUNK ref='2' type='adi-kat' alloc='5' si='top'> |
|||
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> |
|||
</NODE> |
|||
<CHUNK ref='1' type='is' alloc='0' si='subj'> |
|||
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref='3' type='is' alloc='8' si='att'> |
|||
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------STEP4 |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus > |
|||
<SENTENCE ref='1' alloc='0'> |
|||
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'> |
|||
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> |
|||
</NODE> |
|||
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> |
|||
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'> |
|||
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------STEP5 |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus > |
|||
<SENTENCE ref='1' alloc='0'> |
|||
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'> |
|||
<NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> |
|||
</NODE> |
|||
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> |
|||
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'> |
|||
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------STEP6 |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus > |
|||
<SENTENCE ref='1' alloc='0'> |
|||
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> |
|||
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> |
|||
</NODE> |
|||
<CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'> |
|||
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> |
|||
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------STEP7 |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus > |
|||
<SENTENCE ref='1' alloc='0'> |
|||
<CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> |
|||
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> |
|||
</NODE> |
|||
<CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> |
|||
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> |
|||
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------STEP8 |
|||
<?xml version='1.0' encoding='UTF-8'?> |
|||
<corpus > |
|||
<SENTENCE ord='1' ref='1' alloc='0'> |
|||
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> |
|||
<NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> |
|||
</NODE> |
|||
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> |
|||
<NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> |
|||
<NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------STEP9 |
|||
<?xml version='1.0' encoding='UTF-8' ?> |
|||
<corpus > |
|||
<SENTENCE ord='1' ref='1' alloc='0'> |
|||
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> |
|||
<NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> |
|||
</NODE> |
|||
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> |
|||
<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> |
|||
<NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------- step10 |
|||
<?xml version='1.0' encoding='UTF-8'?> |
|||
<corpus > |
|||
<SENTENCE ord='1' ref='1' alloc='0'> |
|||
<CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> |
|||
<NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> |
|||
</NODE> |
|||
<CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> |
|||
<NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> |
|||
</NODE> |
|||
</CHUNK> |
|||
<CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> |
|||
<NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> |
|||
<NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> |
|||
</NODE> |
|||
</NODE> |
|||
</CHUNK> |
|||
</CHUNK> |
|||
</SENTENCE> |
|||
</corpus> |
|||
-------------STEP11 |
|||
Hau proba bat da |
|||
</pre> |
Latest revision as of 12:19, 5 May 2016
Documentation Descripción del sistema de traducción es-eu Matxin page 30, 6.1: image is missing (Diagrama1.dia) Muki987 12:18, 9 April 2009 (UTC)
- Please could you email this to the developers of Matxin. I have emailed you their contact details. - Francis Tyers 12:39, 9 April 2009 (UTC)
Contents
New instructions (2012)[edit]
a) install Foma
svn co http://devel.cpl.upc.edu/freeling/svn/trunk freeling svn co http://matxin.svn.sourceforge.net/svnroot/matxin/trunk matxin
In freeling if you get an error like:
g++ -DPACKAGE_NAME=\"FreeLing\" -DPACKAGE_TARNAME=\"freeling\" -DPACKAGE_VERSION=\"3.0\" -DPACKAGE_STRING=\"FreeLing\ 3.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE=\"freeling\" -DVERSION=\"3.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_BOOST_REGEX_HPP=1 -DHAVE_BOOST_REGEX_ICU_HPP=1 -DHAVE_BOOST_FILESYSTEM_HPP=1 -DHAVE_BOOST_PROGRAM_OPTIONS_HPP=1 -DHAVE_BOOST_THREAD_HPP=1 -DHAVE_STDBOOL_H=1 -DSTDC_HEADERS=1 -I. -I../../src/include -I../../src/libfreeling/corrector -O3 -Wall -MT threaded_analyzer.o -MD -MP -MF .deps/threaded_analyzer.Tpo -c -o threaded_analyzer.o `test -f 'sample_analyzer/threaded_analyzer.cc' || echo './'`sample_analyzer/threaded_analyzer.cc In file included from /usr/include/boost/thread/thread_time.hpp:9, from /usr/include/boost/thread/locks.hpp:11, from /usr/include/boost/thread/pthread/mutex.hpp:11, from /usr/include/boost/thread/mutex.hpp:16, from /usr/include/boost/thread/pthread/thread.hpp:14, from /usr/include/boost/thread/thread.hpp:17, from /usr/include/boost/thread.hpp:12, from sample_analyzer/threaded_processor.h:35, from sample_analyzer/threaded_analyzer.cc:51: /usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected identifier before numeric constant /usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected `}' before numeric constant /usr/include/boost/date_time/microsec_time_clock.hpp:68: error: expected unqualified-id before numeric constant /usr/include/boost/date_time/microsec_time_clock.hpp:70: error: expected unqualified-id before ‘public’ /usr/include/boost/date_time/microsec_time_clock.hpp:79: error: ‘time_type’ does not name a type /usr/include/boost/date_time/microsec_time_clock.hpp:84: error: expected unqualified-id before ‘private’ sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input sample_analyzer/threaded_analyzer.cc:341: error: expected `}' at end of input make[2]: *** [threaded_analyzer.o] Error 1
Then edit freeling/src/main/Makefile.am
and remove $(TH_AN) from the bin_PROGRAMS.
Old instructions (before 2016)[edit]
Prerequisites[edit]
Debian/buntu[edit]
Install freeling-3.1 from the tarball; prerequisites include
sudo apt-get install libboost-system-dev libicu-dev libboost-regex-dev \ libboost-program-options-dev libboost-thread-dev
Add -lboost_system to the dicc2phon_LDADD line in src/utilities/Makefile.am, should look like:
dicc2phon_LDADD = -lfreeling $(FREELING_DEPS) -lboost_system
Then
autoreconf -fi ./configure --prefix=$HOME/PREFIX/freeling make make install
Add the nightly repo and do
sudo apt-get install apertium-all-dev foma-bin libfoma0-dev
Then just
git clone https://github.com/matxin/matxin cd matxin export PATH="${PATH}:$HOME/PREFIX/freeling/bin export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HOME/PREFIX/freeling/lib" export PKG_CONFIG_PATH="${PKG_CONFIG_PATH}:$HOME/PREFIX/freeling/share/pkgconfig:$HOME/PREFIX/freeling/lib/pkgconfig" export ACLOCAL_PATH="${ACLOCAL_PATH}:$HOME/PREFIX/freeling/share/aclocal" autoreconf -fi ./configure --prefix=$HOME/PREFIX/matxin make CPPFLAGS="-I$HOME/PREFIX/freeling/include -I/usr/include -I/usr/include/lttoolbox-3.3 -I/usr/include/libxml2" LDFLAGS="-L$HOME/PREFIX/freeling/lib -L/usr/lib"
- having to send CPPFLAGS/LDFLAGS to make here seems like an autotools bug?
old prerequisites[edit]
- BerkleyDB — sudo apt-get install libdb4.6++-dev (or libdb4.8++-dev)
- libpcre3 — sudo apt-get install libpcre3-dev
Install the following libraries in <prefix>,
- libcfg+ — http://platon.sk/upload/_projects/00003/libcfg+-0.6.2.tar.gz
- libomlet (from SVN) — (
svn co http://devel.cpl.upc.edu/freeling/svn/latest/omlet
) - libfries (from SVN) — (
svn co http://devel.cpl.upc.edu/freeling/svn/latest/fries
) - FreeLing (from SVN) — (
svn co http://devel.cpl.upc.edu/freeling/svn/latest/freeling
)
- If you're installing into a prefix, you'll need to set two environment variables: CPPFLAGS=-I<prefix>/include LDFLAGS=-L<prefix>/lib ./configure --prefix=<prefix>
- lttoolbox (from SVN) — (
svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
) Take as a minimum version 3.1.1; 3.1.0 and lower versions cause data error and error messages in Matxin due to a missing string close.
Building[edit]
- Checkout
$ svn co http://matxin.svn.sourceforge.net/svnroot/matxin
Then do the usual:
$ ./configure --prefix=<prefix> $ make
After you've got it built, do:
$ su # export LD_LIBRARY_PATH=/usr/local/lib # export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig # make install
Mac OS X[edit]
If you've installed boost etc. with Macports, for the configure step do:
env LDFLAGS="-L/opt/local/lib -L/opt/local/lib/db46" CPPFLAGS="-I/opt/local/include -I/opt/local/include/db46 -I/path/to/freeling/libcfg+" ./configure
(their configure script doesn't complain if it can't find db46 or cfg+.h, but make does)
Also, comment out any references to {txt,html,rtf}-deformat.cc in src/Makefile.am and change data/Makefile.am so that you use gzcat instead of zcat.
Executing[edit]
The default for MATXIN_DIR
, if you have not specified a prefix is /usr/local/bin
, if you have not specified a prefix, then you should cd /usr/local/bin
to make the tests.
Bundled with Matxin there's a script called Matxin_translator
which calls all the necessary modules and interconnects them using UNIX pipes. This is the recommended way of running Matxin for getting translations.
$ echo "Esto es una prueba" | ./Matxin_translator -c $MATXIN_DIR/share/matxin/config/es-eu.cfg
There exists a program txt-deformat calling sequence: txt-deformat format-file input-file. txt-deformat creates an xml file from a normal txt input file. This can be used before ./Analyzer.
txt-deformat is a plain text format processor. Data should be passed through this processor before being piped to /Analyzer.
Calling it with -h or --help displays help information. You could write the following to show how the word "gener" is analysed:
echo "gener" | ./txt-deformat | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg
For advanced uses, you can run each part of the pipe separately and save the output to temporary files for feeding the next modules.
Spanish-Basque[edit]
<prefix> is typically /usr/local
$ export MATXIN_DIR=<prefix> $ echo "Esto es una prueba" | \ ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./ST_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./ST_prep -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./ST_verb -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./SG_inter -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./SG_intra -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./MG -f $MATXIN_DIR/share/matxin/config/es-eu.cfg | \ ./reFormat Da proba bat hau
English-Basque[edit]
Using the above example for English-Basque looks:
$ cat src/matxinallen.sh src/Analyzer -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/LT -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/ST_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/ST_inter --inter 1 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/ST_prep -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/ST_inter --inter 2 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/ST_verb -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/ST_inter --inter 3 -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/SG_inter -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/SG_intra -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/MG -f $MATXIN_DIR/share/matxin/config/en-eu.cfg | \ src/reFormat $ echo "This is a test" | sh src/matxin_allen.sh Hau proba da $ echo "How are you?" | sh src/matxin_allen.sh Nola zu da? $ echo "Otto plays football and tennis" | sh src/matxin_allen.sh Otto-ak jokatzen du futbola tenis-a eta
Speed[edit]
Between 25--30 words per second.
Troubleshooting[edit]
libdb[edit]
g++ -g -O2 -ansi -march=i686 -O3 -fno-pic -fomit-frame-pointer -L/usr/local/lib -L/usr/lib -o Analyzer Analyzer.o IORedirectHandler.o -lmorfo -lcfg+ -ldb_cxx -lfries -lomlet -lboost_filesystem -L/usr/local/lib -llttoolbox3 -lxml2 -lpcre /usr/local/lib/libmorfo.so: undefined reference to `Db::set_partition_dirs(char const**) [and a lot of similar lines]
Try installing libdb4.8++-dev[1]
libcfg+[edit]
If you get the following error:
ld: ../src/cfg+.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC
Delete the directory, and start from scratch, this time when you call make, call it with make CFLAGS=-fPIC
Various errors[edit]
If you get the error:
g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include -I/usr/local/include/lttoolbox-2.0 -I/usr/include/libxml2 -g -O2 -ansi -march=i686 -O3 -fno-pic -fomit-frame-pointer -MT Analyzer.o -MD -MP -MF .deps/Analyzer.Tpo -c -o Analyzer.o Analyzer.C --->Analyzer.C:10:22: error: freeling.h: Datei oder Verzeichnis nicht gefunden In file included from Analyzer.C:9: config.h: In constructor 'config::config(char**)': config.h:413: warning: deprecated conversion from string constant to 'char*' Analyzer.C: In function 'void PrintResults(std::list<sentence, std::allocator<sentence> >&, const config&, int&)': Analyzer.C:123: error: aggregate 'std::ofstream log_file' has incomplete type and cannot be defined Analyzer.C:126: error: incomplete type 'std::ofstream' used in nested name s...
Then change the header files in src/Analyzer.C
to:
//#include "freeling.h" #include "util.h" #include "tokenizer.h" #include "splitter.h" #include "maco.h" #include "nec.h" #include "senses.h" #include "tagger.h" #include "hmm_tagger.h" #include "relax_tagger.h" #include "chart_parser.h" #include "maco_options.h" #include "dependencies.h"
Upon finding yourself battling the following compile problem,
Analyzer.C: In function ‘int main(int, char**)’: Analyzer.C:226: error: no matching function for call to ‘hmm_tagger::hmm_tagger(std::string, char*&, int&, int&)’ /home/fran/local/include/hmm_tagger.h:108: note: candidates are: hmm_tagger::hmm_tagger(const std::string&, const std::string&, bool) /home/fran/local/include/hmm_tagger.h:84: note: hmm_tagger::hmm_tagger(const hmm_tagger&) Analyzer.C:230: error: no matching function for call to ‘relax_tagger::relax_tagger(char*&, int&, double&, double&, int&, int&)’ /home/fran/local/include/relax_tagger.h:74: note: candidates are: relax_tagger::relax_tagger(const std::string&, int, double, double, bool) /home/fran/local/include/relax_tagger.h:51: note: relax_tagger::relax_tagger(const relax_tagger&) Analyzer.C:236: error: no matching function for call to ‘senses::senses(char*&, int&)’ /home/fran/local/include/senses.h:52: note: candidates are: senses::senses(const std::string&) /home/fran/local/include/senses.h:45: note: senses::senses(const senses&)
Make the following changes in the file src/Analyzer.C
:
if (cfg.TAGGER_which == HMM) - tagger = new hmm_tagger(cfg.Lang, cfg.TAGGER_HMMFile, cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect); + tagger = new hmm_tagger(string(cfg.Lang), string(cfg.TAGGER_HMMFile), false); else if (cfg.TAGGER_which == RELAX) - tagger = new relax_tagger(cfg.TAGGER_RelaxFile, cfg.TAGGER_RelaxMaxIter, + tagger = new relax_tagger(string(cfg.TAGGER_RelaxFile), cfg.TAGGER_RelaxMaxIter, cfg.TAGGER_RelaxScaleFactor, cfg.TAGGER_RelaxEpsilon, - cfg.TAGGER_Retokenize, cfg.TAGGER_ForceSelect); + false); if (cfg.NEC_NEClassification) neclass = new nec("NP", cfg.NEC_FilePrefix); if (cfg.SENSE_SenseAnnotation!=NONE) - sens = new senses(cfg.SENSE_SenseFile, cfg.SENSE_DuplicateAnalysis); + sens = new senses(string(cfg.SENSE_SenseFile)); //, cfg.SENSE_DuplicateAnalysis);
Then probably there will be issues with actually running Matxin.
If you get the error:
config.h:33:29: error: freeling/traces.h: No such file or directory
Then change the header files in src/config.h
to:
//#include "freeling/traces.h" #include "traces.h"
If you get this error:
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 2. Syntax error: Unexpected 'SETS' found. Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 7. Syntax error: Unexpected 'DetFem' found. Constraint Grammar '/home/fran/local//share/matxin/freeling/es/constr_gram.dat'. Line 10. Syntax error: Unexpected 'VerbPron' found.
You can change the tagger from the RelaxCG to HMM, edit the file <prefix>/share/matxin/config/es-eu.cfg
, and change:
#### Tagger options #Tagger=relax Tagger=hmm
Then there might be a problem in the dependency grammar:
$ echo "Esto es una prueba" | ./Analyzer -f /home/fran/local/share/matxin/config/es-eu.cfg DEPENDENCIES: Error reading dependencies from '/home/fran/local//share/matxin/freeling/es/dep/dependences.dat'. Unregistered function d:sn.tonto
The easiest thing to do here is to just remove references to the stuff it complains about:
cat <prefix>/share/matxin/freeling/es/dep/dependences.dat | grep -v d:grup-sp.lemma > newdep cat newdep | grep -v d\.class > newdep2 cat newdep2 | grep -v d:sn.tonto > <prefix>/share/matxin/freeling/es/dep/dependences.dat
Error in db[edit]
If you get:
- SEMDB: Error 13 while opening database /usr/local/share/matxin/freeling/es/dep/../senses16.db
rebuild senses16.deb from source:
- cat senses16.src | indexdict senses16.db
- (remove senses16.db before rebuild)
Error when reading xml files[edit]
If xml files read does not work, you get error like: ERROR: invalid document: found <corpus i> when <corpus> was expected..., do following in src/XML_reader.cc do:
1. add following subroutine after line 43:
wstring mystows(string const &str) { wchar_t* result = new wchar_t[str.size()+1]; size_t retval = mbstowcs(result, str.c_str(), str.size()); result[retval] = L'\0'; wstring result2 = result; delete[] result; return result2; }
2. replace all occurencies of
XMLParseUtil::stows
with
mystows
Version 3.1.1 of lttoolbox does not have this error any more.
Results of the individual steps:[edit]
--------------------Step1 en@anonymous:/usr/local/bin$ echo "Esto es una prueba" | ./Analyzer -f $MATXIN_DIR/share/matxin/config/es-eu.cfg <?xml version='1.0' encoding='UTF-8' ?> <corpus> <SENTENCE ord='1' alloc='0'> <CHUNK ord='2' alloc='5' type='grup-verb' si='top'> <NODE ord='2' alloc='5' form='es' lem='ser' mi='VSIP3S0'> </NODE> <CHUNK ord='1' alloc='0' type='sn' si='subj'> <NODE ord='1' alloc='0' form='Esto' lem='este' mi='PD0NS000'> </NODE> </CHUNK> <CHUNK ord='3' alloc='8' type='sn' si='att'> <NODE ord='4' alloc='12' form='prueba' lem='prueba' mi='NCFS000'> <NODE ord='3' alloc='8' form='una' lem='uno' mi='DI0FS0'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus>
---------------------Step2 [glabaka@siuc05 bin]$ cat /tmp/x | ./LT -f $MATXIN_DIR/share/matxin/config/es-eu.cfg <?xml version='1.0' encoding='UTF-8'?> <corpus > <SENTENCE ref='1' alloc='0'> <CHUNK ref='2' type='adi-kat' alloc='5' si='top'> <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> </NODE> <CHUNK ref='1' type='is' alloc='0' si='subj'> <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ref='3' type='is' alloc='8' si='att'> <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus>
----------- step3 <?xml version='1.0' encoding='UTF-8' ?> <corpus > <SENTENCE ref='1' alloc='0'> <CHUNK ref='2' type='adi-kat' alloc='5' si='top'> <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> </NODE> <CHUNK ref='1' type='is' alloc='0' si='subj'> <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ref='3' type='is' alloc='8' si='att'> <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------STEP4 <?xml version='1.0' encoding='UTF-8' ?> <corpus > <SENTENCE ref='1' alloc='0'> <CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'> <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> </NODE> <CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'> <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------STEP5 <?xml version='1.0' encoding='UTF-8' ?> <corpus > <SENTENCE ref='1' alloc='0'> <CHUNK ref='2' type='adi-kat' alloc='5' si='top' length='1' trans='DU' cas='[ABS]'> <NODE ref='2' alloc='5' UpCase='none' lem='_izan_' mi='VSIP3S0' pos='[ADI][SIN]'> </NODE> <CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ref='3' type='is' alloc='8' si='att' length='2' cas='[ABS]'> <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------STEP6 <?xml version='1.0' encoding='UTF-8' ?> <corpus > <SENTENCE ref='1' alloc='0'> <CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> <NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> </NODE> <CHUNK ref='1' type='is' alloc='0' si='subj' cas='[ERG]' length='1'> <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------STEP7 <?xml version='1.0' encoding='UTF-8' ?> <corpus > <SENTENCE ref='1' alloc='0'> <CHUNK ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> <NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> </NODE> <CHUNK ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------STEP8 <?xml version='1.0' encoding='UTF-8'?> <corpus > <SENTENCE ord='1' ref='1' alloc='0'> <CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> <NODE ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> </NODE> <CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> <NODE ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> <NODE ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------STEP9 <?xml version='1.0' encoding='UTF-8' ?> <corpus > <SENTENCE ord='1' ref='1' alloc='0'> <CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> <NODE ord='0' ref='2' alloc='5' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> </NODE> <CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> <NODE ord='0' ref='1' alloc='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> <NODE ord='0' ref='4' alloc='12' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE ord='1' ref='3' alloc='8' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------- step10 <?xml version='1.0' encoding='UTF-8'?> <corpus > <SENTENCE ord='1' ref='1' alloc='0'> <CHUNK ord='2' ref='2' type='adi-kat' alloc='5' si='top' cas='[ABS]' trans='DU' length='1'> <NODE form='da' ref ='2' alloc ='5' ord='0' lem='izan' pos='[NAG]' mi='[ADT][A1][NR_HU]'> </NODE> <CHUNK ord='0' ref='1' type='is' alloc='0' si='subj' length='1' cas='[ERG]'> <NODE form='hau' ref ='1' alloc ='0' ord='0' UpCase='none' lem='hau' pos='[DET][ERKARR]'> </NODE> </CHUNK> <CHUNK ord='1' ref='3' type='is' alloc='8' si='att' cas='[ABS]' length='2'> <NODE form='proba' ref ='4' alloc ='12' ord='0' UpCase='none' lem='proba' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'> <NODE form='bat' ref ='3' alloc ='8' ord='1' UpCase='none' lem='bat' pos='[DET][DZH]' vpost='IZO'> </NODE> </NODE> </CHUNK> </CHUNK> </SENTENCE> </corpus> -------------STEP11 Hau proba bat da