Talk:ReTraTos
Revision as of 22:26, 12 March 2008 by Francis Tyers (talk | contribs) (New page: <pre> DESCRIPTION ReTraTos package is composed of two bilingual resources induction programs: - ReTraTos.pl: induces rules from corpora - ReTraTos_lex.pl: induces bilingual dictionaries...)
DESCRIPTION
ReTraTos package is composed of two bilingual resources induction programs:
- ReTraTos.pl: induces rules from corpora
- ReTraTos_lex.pl: induces bilingual dictionaries from corpora
At the moment there is no engine (in this package) to perform translation based
on the induced resources.
INPUT FORMAT
Two parallel texts are used as input for both inductors. In this text each sentence
has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence
tag (<s>) has an attribute (snum) whose value is an identificator for this
sentence. Parallel sentences have the same identificator in source and target files.
Example:
Source sentence
<s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s>
Target sentence (translation of source sentence identified as 1)
<s snum=1>targettoken1 targettoken2 ... targettokenn</s>
Each token in each sentence has to be separated by a white space as show above.
Each token can have at most 5 pieces of information:
1. sur: the surface form of a word or a special character, that is,
the token as it was found in the original sentences. For example: houses,
living and .
2. bas: the lemma of a word or a special character, a number, etc. when
it was tagged by the PoS tagger. For example: house, live and .
3. pos: PoS of lexical item according to the PoS tagger. The words unknown
by the tagger (not tagged) and many special characters do not have this
information. For example: n (noun), vblex (verb) or nothing.
4. atr: the value of each morphological attribute of a PoS tag. Each attribute
value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund).
5. ali: a sequence of one or more numbers (separated by "_") refering to the
positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0.
This information is derived from preprocessing the parallel texts with at
least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali).
The tokens are formated as shown below:
1. \*sup/sup:ali
Unknown words. For example: *piquia/piquia:4
2. sup:ali
Special characters not tagged by the PoS tagger. For example: ":27
3. sup/C[\+C]*:ali
Other words and special characters tagged by the PoS tagger, in which
C = base<pos>A* e
A = [attribute]+
For example: houses/house<n><pl>:14, living/live<vblex><ger>:3,
cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25
Example of input parallel sentences:
Portuguese
<s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s>
English
<s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s>
OUTPUT FORMAT
* Bilingual dictionaries are in a XML format very similiar to that used by
Apertium open-source machine translation platform (http://apertium.sourceforge.net/)
* Transfer rules are in a human readable format and a new module are being
developed to put them in the Apertium's XML format
REQUIREMENTS
* ReTraTos needs Perl installed in the system.
QUICK START
1) Download the package for retratos-VERSION.tar.gz
2) Unpack retratos and do ('#' means 'do that with root privileges'):
$ cd retratos-VERSION
$ ./configure
$ make
# make install
3) Use the dictionary inductor (ReTraTos_lex.pl)
USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu]
-sourcefile|s sourcefile file with examples in source language (required)
-targetfile|t targetfile file with examples in target language (required)
-beginning|b headerfile file with the beginning of a bilingual dictionary (required)
-ending|e footerfile file with the ending of a bilingual dictionary (required)
-attrsfile|a attfile file with information about atributes (optional)
-multifreq|f freqmwu frequency threshold to filter multiword units (default=1)
Sample:
$ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50
4) Use the rule inductor (ReTraTos.pl)
USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v]
-sourcefile|s sourcefile file with examples in source language (required)
-targetfile|t targetfile file with examples in target language (required)
-type|ty type alignment type: 0, 1, 2 or 3 (all) (default=3)
-level|l level rules\' abstraction level(s) (default=pos)
-include_gra|ig inpos PoS for which induce rules (default=all)
-exclude_gra|eg outpos PoS for which do not induce rules (default=none)
-per_ident|pi percident % for frequency threshold on pattern ident. (df=0.0015)
-filter|fi determines if filter will be applied (default=no)
-per_filter|pf percfilt % for frequency threshold on rule filtering (df=0.0015)
-sort|so determines if sorting will be done (default=no)
-remove|r remove auxiliary files
-verbose|v verbose
Sample:
$ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so