Talk:ReTraTos

From Apertium
Revision as of 22:26, 12 March 2008 by Francis Tyers (talk | contribs) (New page: <pre> DESCRIPTION ReTraTos package is composed of two bilingual resources induction programs: - ReTraTos.pl: induces rules from corpora - ReTraTos_lex.pl: induces bilingual dictionaries...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
DESCRIPTION

ReTraTos package is composed of two bilingual resources induction programs:
 - ReTraTos.pl: induces rules from corpora
 - ReTraTos_lex.pl: induces bilingual dictionaries from corpora

At the moment there is no engine (in this package) to perform translation based 
on the induced resources.

INPUT FORMAT

Two parallel texts are used as input for both inductors. In this text each sentence 
has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence 
tag (<s>) has an attribute (snum) whose value is an identificator for this 
sentence. Parallel sentences have the same identificator in source and target files. 

 Example:
  
  Source sentence
  <s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s>
  
  Target sentence (translation of source sentence identified as 1)
  <s snum=1>targettoken1 targettoken2 ... targettokenn</s>

Each token in each sentence has to be separated by a white space as show above.
Each token can have at most 5 pieces of information:

        1. sur: the surface form of a word or a special character, that is, 
        the token as it was found in the original sentences. For example: houses, 
        living and .

        2. bas: the lemma of a word or a special character, a number, etc. when 
        it was tagged by the PoS tagger. For example: house, live and .

        3. pos: PoS of lexical item according to the PoS tagger. The words unknown 
        by the tagger (not tagged) and many special characters do not have this 
        information. For example: n (noun), vblex (verb) or nothing.

        4. atr: the value of each morphological attribute of a PoS tag. Each attribute 
        value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund).

        5. ali: a sequence of one or more numbers (separated by "_") refering to the 
        positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0.

        This information is derived from preprocessing the parallel texts with at 
        least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali).

  The tokens are formated as shown below:

  1. \*sup/sup:ali
     Unknown words. For example: *piquia/piquia:4
  2. sup:ali
     Special characters not tagged by the PoS tagger. For example: ":27
  3. sup/C[\+C]*:ali
     Other words and special characters tagged by the PoS tagger, in which
     C = base<pos>A* e
     A = [attribute]+
     For example: houses/house<n><pl>:14, living/live<vblex><ger>:3,
     cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25
     
  Example of input parallel sentences:
  
  Portuguese
  <s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s>
  
  English
  <s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s>


OUTPUT FORMAT

* Bilingual dictionaries are in a XML format very similiar to that used by 
Apertium open-source machine translation platform (http://apertium.sourceforge.net/)

* Transfer rules are in a human readable format and a new module are being
developed to put them in the Apertium's XML format

REQUIREMENTS

* ReTraTos needs Perl installed in the system.

QUICK START

1) Download the package for retratos-VERSION.tar.gz

2) Unpack retratos and do ('#' means 'do that with root privileges'):
   $ cd retratos-VERSION
   $ ./configure
   $ make
   # make install

3) Use the dictionary inductor (ReTraTos_lex.pl)
   
   USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu]
    -sourcefile|s sourcefile    file with examples in source language (required)
    -targetfile|t targetfile    file with examples in target language (required)
    -beginning|b  headerfile    file with the beginning of a bilingual dictionary (required)
    -ending|e     footerfile    file with the ending of a bilingual dictionary (required)
    -attrsfile|a  attfile       file with information about atributes (optional)
    -multifreq|f  freqmwu       frequency threshold to filter multiword units (default=1)

   Sample:

   $ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50

4) Use the rule inductor (ReTraTos.pl)
   
   USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v]
    -sourcefile|s sourcefile  file with examples in source language (required)
    -targetfile|t targetfile  file with examples in target language (required)
    -type|ty      type        alignment type: 0, 1, 2 or 3 (all) (default=3)
    -level|l      level       rules\' abstraction level(s) (default=pos)
    -include_gra|ig inpos     PoS for which induce rules (default=all)
    -exclude_gra|eg outpos    PoS for which do not induce rules (default=none)
    -per_ident|pi percident   % for frequency threshold on pattern ident. (df=0.0015)
    -filter|fi                determines if filter will be applied (default=no)
    -per_filter|pf percfilt   % for frequency threshold on rule filtering (df=0.0015)
    -sort|so                  determines if sorting will be done (default=no)
    -remove|r                 remove auxiliary files
    -verbose|v                verbose   

   Sample:

   $ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so