Difference between revisions of "Talk:ReTraTos"

From Apertium
Jump to navigation Jump to search
(New page: <pre> DESCRIPTION ReTraTos package is composed of two bilingual resources induction programs: - ReTraTos.pl: induces rules from corpora - ReTraTos_lex.pl: induces bilingual dictionaries...)
 
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Giza → LIHLA==
<pre>
DESCRIPTION


ReTraTos package is composed of two bilingual resources induction programs:
- ReTraTos.pl: induces rules from corpora
- ReTraTos_lex.pl: induces bilingual dictionaries from corpora


<pre>
At the moment there is no engine (in this package) to perform translation based
$ cat giza_to_lihla.pl
on the induced resources.
#!/usr/bin/perl
# Programa GIZA_to_LIHLA
# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
# Saida: Arquivos alinhados por GIZA no formato de LIHLA
# Funcao: Converte a saida de GIZA no padrao de LIHLA


use strict;
INPUT FORMAT
use locale;


if ($#ARGV < 2) {
Two parallel texts are used as input for both inductors. In this text each sentence
print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence
exit 1;
tag (<s>) has an attribute (snum) whose value is an identificator for this
};
sentence. Parallel sentences have the same identificator in source and target files.


my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);
Example:
Source sentence
<s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s>
Target sentence (translation of source sentence identified as 1)
<s snum=1>targettoken1 targettoken2 ... targettokenn</s>


$entrada = shift(@ARGV);
Each token in each sentence has to be separated by a white space as show above.
$dirfonte = shift(@ARGV);
Each token can have at most 5 pieces of information:
$diralvo = shift(@ARGV);


if ($dirfonte !~ /\/$/) {
1. sur: the surface form of a word or a special character, that is,
$dirfonte .= '/';
the token as it was found in the original sentences. For example: houses,
}
living and .
if ($diralvo !~ /\/$/) {
$diralvo .= '/';
}


mkdir($dirfonte);
2. bas: the lemma of a word or a special character, a number, etc. when
mkdir($diralvo);
it was tagged by the PoS tagger. For example: house, live and .


print STDERR "Dir fonte: $dirfonte\n";
3. pos: PoS of lexical item according to the PoS tagger. The words unknown
print STDERR "Dir alvo: $diralvo\n";
by the tagger (not tagged) and many special characters do not have this
information. For example: n (noun), vblex (verb) or nothing.


# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
4. atr: the value of each morphological attribute of a PoS tag. Each attribute
# etiquetados_pos/es/ES-ci-abr03_01
value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund).
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 })


$ci = 0;
5. ali: a sequence of one or more numbers (separated by "_") refering to the
$fonte = $alvo = "";
positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0.
$sent = -1;


open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
This information is derived from preprocessing the parallel texts with at
while (<ARQ>) {
least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali).
s/\n//g;
#if (/^([^\s]+\/)+([^\/]+)$/) {
if ($ci == 0) {
if (($fonte ne "") && ($alvo ne "")) {
close OUTF;
close OUTA;
}
$sent = 0;
$alvo = $diralvo.$entrada.$2;
#print "$alvo\n";
#$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
$alvo .= '.al'; # e poe .al
#print STDERR "Formatando arquivos $1 e ";
#$t = <ARQ>;
#if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) {
$fonte = $dirfonte.$entrada.$1;
#$fonte =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
$fonte .= '.al'; # e poe .al
#}
#print STDERR "$1\n";
#print STDERR "fonte: $fonte\n";
#print STDERR "alvo: $alvo\n";
open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
$ci = 1;
}
elsif (/^\#/) {
$sent++;
next;
}
else {
s/\n//;
@talvo = split(/ /,$_);
$_ = <ARQ>;
s/\n//;
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 })
@tfonte = split(/\s+\}\)\s*/,$_);
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
@ali = split(/\s/,$al);
while ($#ali >= 0) {
$talvo[shift(@ali)-1] .= ":0";
}
$i = 0;
while ($i <= $#tfonte) {
($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
if ($al =~ /\d+/) {
@ali = split(/\s/,$al);
$tfonte[$i] = $t.":".join("_",@ali);
while ($#ali >= 0) {
if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) {
$talvo[$ali[0]-1] .= ":".($i+1);
} else {
$talvo[$ali[0]-1] .= "_".($i+1);
}
shift(@ali);
}
} else {
$tfonte[$i] =~ s/\s*\(\{/:0/g;
}
$i++;
}


map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);
The tokens are formated as shown below:


print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
1. \*sup/sup:ali
print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
Unknown words. For example: *piquia/piquia:4
2. sup:ali
}
}
Special characters not tagged by the PoS tagger. For example: ":27
close OUTF;
3. sup/C[\+C]*:ali
close OUTA;
Other words and special characters tagged by the PoS tagger, in which
</pre>
C = base<pos>A* e
A = [attribute]+
For example: houses/house<n><pl>:14, living/live<vblex><ger>:3,
cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25
Example of input parallel sentences:
Portuguese
<s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s>
English
<s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s>


==Feature requests==


* Ability to specify constraints -- e.g. only allow nouns to be translated by nouns
OUTPUT FORMAT
* Ability to turn off multiword generation altogether

* Script to select "high quality" sentences from an aligned corpora -- e.g. strip out those with excess punctuation or numbers
* Bilingual dictionaries are in a XML format very similiar to that used by
* Something to take into account an existing bilingual dictionary -- either as a bootstrap or something like this.
Apertium open-source machine translation platform (http://apertium.sourceforge.net/)

* Transfer rules are in a human readable format and a new module are being
developed to put them in the Apertium's XML format

REQUIREMENTS

* ReTraTos needs Perl installed in the system.

QUICK START

1) Download the package for retratos-VERSION.tar.gz

2) Unpack retratos and do ('#' means 'do that with root privileges'):
$ cd retratos-VERSION
$ ./configure
$ make
# make install

3) Use the dictionary inductor (ReTraTos_lex.pl)
USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu]
-sourcefile|s sourcefile file with examples in source language (required)
-targetfile|t targetfile file with examples in target language (required)
-beginning|b headerfile file with the beginning of a bilingual dictionary (required)
-ending|e footerfile file with the ending of a bilingual dictionary (required)
-attrsfile|a attfile file with information about atributes (optional)
-multifreq|f freqmwu frequency threshold to filter multiword units (default=1)

Sample:

$ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50

4) Use the rule inductor (ReTraTos.pl)
USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v]
-sourcefile|s sourcefile file with examples in source language (required)
-targetfile|t targetfile file with examples in target language (required)
-type|ty type alignment type: 0, 1, 2 or 3 (all) (default=3)
-level|l level rules\' abstraction level(s) (default=pos)
-include_gra|ig inpos PoS for which induce rules (default=all)
-exclude_gra|eg outpos PoS for which do not induce rules (default=none)
-per_ident|pi percident % for frequency threshold on pattern ident. (df=0.0015)
-filter|fi determines if filter will be applied (default=no)
-per_filter|pf percfilt % for frequency threshold on rule filtering (df=0.0015)
-sort|so determines if sorting will be done (default=no)
-remove|r remove auxiliary files
-verbose|v verbose

Sample:

$ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so

</pre>

Latest revision as of 22:44, 1 April 2008

Giza → LIHLA[edit]

$ cat giza_to_lihla.pl 
#!/usr/bin/perl
# Programa GIZA_to_LIHLA
# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
# Saida: Arquivos alinhados por GIZA no formato de LIHLA
# Funcao: Converte a saida de GIZA no padrao de LIHLA

use strict;
use locale;

if ($#ARGV < 2) {
    print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
    exit 1;
};

my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);

$entrada = shift(@ARGV);
$dirfonte = shift(@ARGV);
$diralvo = shift(@ARGV);

if ($dirfonte !~ /\/$/) { 
        $dirfonte .= '/'; 
}
if ($diralvo !~ /\/$/) { 
        $diralvo .= '/'; 
}

mkdir($dirfonte);
mkdir($diralvo);

print STDERR "Dir fonte: $dirfonte\n";
print STDERR "Dir alvo: $diralvo\n";

# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
# etiquetados_pos/es/ES-ci-abr03_01 
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) 

$ci = 0;
$fonte = $alvo = "";
$sent = -1;

open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
while (<ARQ>) {
        s/\n//g;
        #if (/^([^\s]+\/)+([^\/]+)$/) {
        if  ($ci == 0) {
                if (($fonte ne "") && ($alvo ne "")) { 
                        close OUTF; 
                        close OUTA; 
                }
                $sent = 0;
                $alvo = $diralvo.$entrada.$2;
                #print "$alvo\n";
                #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
                $alvo .= '.al';                         # e poe .al
                #print STDERR "Formatando arquivos $1 e ";
                #$t = <ARQ>;
                #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { 
                        $fonte = $dirfonte.$entrada.$1; 
                        #$fonte =~ s/\.\w+[\n\s]*$//g;  # remove a extensao original
                        $fonte .= '.al';                                # e poe .al
                #}
                #print STDERR "$1\n";
                #print STDERR "fonte: $fonte\n";
                #print STDERR "alvo: $alvo\n";
                open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
                open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
                $ci = 1;
        }
        elsif (/^\#/) { 
                $sent++;
                next; 
        }
        else {
                s/\n//;
                @talvo = split(/ /,$_);
                $_ = <ARQ>;
                s/\n//;
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) 
                @tfonte = split(/\s+\}\)\s*/,$_);
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
                ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
                @ali = split(/\s/,$al);
                while ($#ali >= 0) { 
                        $talvo[shift(@ali)-1] .= ":0"; 
                }
                $i = 0;
                while ($i <= $#tfonte) {
                        ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
                        if ($al =~ /\d+/) {
                                @ali = split(/\s/,$al);
                                $tfonte[$i] = $t.":".join("_",@ali);
                                while ($#ali >= 0) { 
                                        if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { 
                                                $talvo[$ali[0]-1] .= ":".($i+1); 
                                        } else { 
                                                $talvo[$ali[0]-1] .= "_".($i+1); 
                                        }
                                        shift(@ali);
                                }
                        } else { 
                                $tfonte[$i] =~ s/\s*\(\{/:0/g; 
                        }
                        $i++;
                }

                map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);

                print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
                print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
        }
}
close OUTF;
close OUTA;

Feature requests[edit]

  • Ability to specify constraints -- e.g. only allow nouns to be translated by nouns
  • Ability to turn off multiword generation altogether
  • Script to select "high quality" sentences from an aligned corpora -- e.g. strip out those with excess punctuation or numbers
  • Something to take into account an existing bilingual dictionary -- either as a bootstrap or something like this.