Difference between revisions of "Talk:ReTraTos"

From Apertium
Jump to navigation Jump to search
(New page: <pre> DESCRIPTION ReTraTos package is composed of two bilingual resources induction programs: - ReTraTos.pl: induces rules from corpora - ReTraTos_lex.pl: induces bilingual dictionaries...)
 
Line 1: Line 1:
==Giza → LIHLA==

<pre>
<pre>
#!/usr/bin/perl
DESCRIPTION
# Programa GIZA_to_LIHLA

# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
ReTraTos package is composed of two bilingual resources induction programs:
# Saida: Arquivos alinhados por GIZA no formato de LIHLA
- ReTraTos.pl: induces rules from corpora
# Funcao: Converte a saida de GIZA no padrao de LIHLA
- ReTraTos_lex.pl: induces bilingual dictionaries from corpora

At the moment there is no engine (in this package) to perform translation based
on the induced resources.

INPUT FORMAT

Two parallel texts are used as input for both inductors. In this text each sentence
has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence
tag (<s>) has an attribute (snum) whose value is an identificator for this
sentence. Parallel sentences have the same identificator in source and target files.

Example:
Source sentence
<s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s>
Target sentence (translation of source sentence identified as 1)
<s snum=1>targettoken1 targettoken2 ... targettokenn</s>

Each token in each sentence has to be separated by a white space as show above.
Each token can have at most 5 pieces of information:

1. sur: the surface form of a word or a special character, that is,
the token as it was found in the original sentences. For example: houses,
living and .

2. bas: the lemma of a word or a special character, a number, etc. when
it was tagged by the PoS tagger. For example: house, live and .

3. pos: PoS of lexical item according to the PoS tagger. The words unknown
by the tagger (not tagged) and many special characters do not have this
information. For example: n (noun), vblex (verb) or nothing.

4. atr: the value of each morphological attribute of a PoS tag. Each attribute
value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund).

5. ali: a sequence of one or more numbers (separated by "_") refering to the
positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0.

This information is derived from preprocessing the parallel texts with at
least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali).

The tokens are formated as shown below:

1. \*sup/sup:ali
Unknown words. For example: *piquia/piquia:4
2. sup:ali
Special characters not tagged by the PoS tagger. For example: ":27
3. sup/C[\+C]*:ali
Other words and special characters tagged by the PoS tagger, in which
C = base<pos>A* e
A = [attribute]+
For example: houses/house<n><pl>:14, living/live<vblex><ger>:3,
cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25
Example of input parallel sentences:
Portuguese
<s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s>
English
<s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s>


OUTPUT FORMAT

* Bilingual dictionaries are in a XML format very similiar to that used by
Apertium open-source machine translation platform (http://apertium.sourceforge.net/)


use strict;
* Transfer rules are in a human readable format and a new module are being
use locale;
developed to put them in the Apertium's XML format


if ($#ARGV < 2) {
REQUIREMENTS
print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
exit 1;
};


my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);
* ReTraTos needs Perl installed in the system.


$entrada = shift(@ARGV);
QUICK START
$dirfonte = shift(@ARGV);
$diralvo = shift(@ARGV);


if ($dirfonte !~ /\/$/) {
1) Download the package for retratos-VERSION.tar.gz
$dirfonte .= '/';
}
if ($diralvo !~ /\/$/) {
$diralvo .= '/';
}


mkdir($dirfonte);
2) Unpack retratos and do ('#' means 'do that with root privileges'):
mkdir($diralvo);
$ cd retratos-VERSION
$ ./configure
$ make
# make install


print STDERR "Dir fonte: $dirfonte\n";
3) Use the dictionary inductor (ReTraTos_lex.pl)
print STDERR "Dir alvo: $diralvo\n";
USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu]
-sourcefile|s sourcefile file with examples in source language (required)
-targetfile|t targetfile file with examples in target language (required)
-beginning|b headerfile file with the beginning of a bilingual dictionary (required)
-ending|e footerfile file with the ending of a bilingual dictionary (required)
-attrsfile|a attfile file with information about atributes (optional)
-multifreq|f freqmwu frequency threshold to filter multiword units (default=1)


# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
Sample:
# etiquetados_pos/es/ES-ci-abr03_01
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 })


$ci = 0;
$ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50
$fonte = $alvo = "";
$sent = -1;


open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
4) Use the rule inductor (ReTraTos.pl)
while (<ARQ>) {
s/\n//g;
USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v]
#if (/^([^\s]+\/)+([^\/]+)$/) {
-sourcefile|s sourcefile file with examples in source language (required)
if ($ci == 0) {
-targetfile|t targetfile file with examples in target language (required)
-type|ty type alignment type: 0, 1, 2 or 3 (all) (default=3)
if (($fonte ne "") && ($alvo ne "")) {
-level|l level rules\' abstraction level(s) (default=pos)
close OUTF;
close OUTA;
-include_gra|ig inpos PoS for which induce rules (default=all)
-exclude_gra|eg outpos PoS for which do not induce rules (default=none)
}
$sent = 0;
-per_ident|pi percident % for frequency threshold on pattern ident. (df=0.0015)
-filter|fi determines if filter will be applied (default=no)
$alvo = $diralvo.$entrada.$2;
#print "$alvo\n";
-per_filter|pf percfilt % for frequency threshold on rule filtering (df=0.0015)
-sort|so determines if sorting will be done (default=no)
#$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
-remove|r remove auxiliary files
$alvo .= '.al'; # e poe .al
-verbose|v verbose
#print STDERR "Formatando arquivos $1 e ";
#$t = <ARQ>;
#if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) {
$fonte = $dirfonte.$entrada.$1;
#$fonte =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
$fonte .= '.al'; # e poe .al
#}
#print STDERR "$1\n";
#print STDERR "fonte: $fonte\n";
#print STDERR "alvo: $alvo\n";
open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
$ci = 1;
}
elsif (/^\#/) {
$sent++;
next;
}
else {
s/\n//;
@talvo = split(/ /,$_);
$_ = <ARQ>;
s/\n//;
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 })
@tfonte = split(/\s+\}\)\s*/,$_);
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
@ali = split(/\s/,$al);
while ($#ali >= 0) {
$talvo[shift(@ali)-1] .= ":0";
}
$i = 0;
while ($i <= $#tfonte) {
($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
if ($al =~ /\d+/) {
@ali = split(/\s/,$al);
$tfonte[$i] = $t.":".join("_",@ali);
while ($#ali >= 0) {
if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) {
$talvo[$ali[0]-1] .= ":".($i+1);
} else {
$talvo[$ali[0]-1] .= "_".($i+1);
}
shift(@ali);
}
} else {
$tfonte[$i] =~ s/\s*\(\{/:0/g;
}
$i++;
}


map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);
Sample:


print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
$ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so
print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
}
}
close OUTF;
close OUTA;


</pre>
</pre>

Revision as of 11:30, 20 March 2008

Giza → LIHLA

#!/usr/bin/perl
# Programa GIZA_to_LIHLA
# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
# Saida: Arquivos alinhados por GIZA no formato de LIHLA
# Funcao: Converte a saida de GIZA no padrao de LIHLA

use strict;
use locale;

if ($#ARGV < 2) {
    print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
    exit 1;
};

my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);

$entrada = shift(@ARGV);
$dirfonte = shift(@ARGV);
$diralvo = shift(@ARGV);

if ($dirfonte !~ /\/$/) { 
        $dirfonte .= '/'; 
}
if ($diralvo !~ /\/$/) { 
        $diralvo .= '/'; 
}

mkdir($dirfonte);
mkdir($diralvo);

print STDERR "Dir fonte: $dirfonte\n";
print STDERR "Dir alvo: $diralvo\n";

# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
# etiquetados_pos/es/ES-ci-abr03_01 
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) 

$ci = 0;
$fonte = $alvo = "";
$sent = -1;

open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
while (<ARQ>) {
        s/\n//g;
        #if (/^([^\s]+\/)+([^\/]+)$/) {
        if  ($ci == 0) {
                if (($fonte ne "") && ($alvo ne "")) { 
                        close OUTF; 
                        close OUTA; 
                }
                $sent = 0;
                $alvo = $diralvo.$entrada.$2;
                #print "$alvo\n";
                #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
                $alvo .= '.al';                         # e poe .al
                #print STDERR "Formatando arquivos $1 e ";
                #$t = <ARQ>;
                #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { 
                        $fonte = $dirfonte.$entrada.$1; 
                        #$fonte =~ s/\.\w+[\n\s]*$//g;  # remove a extensao original
                        $fonte .= '.al';                                # e poe .al
                #}
                #print STDERR "$1\n";
                #print STDERR "fonte: $fonte\n";
                #print STDERR "alvo: $alvo\n";
                open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
                open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
                $ci = 1;
        }
        elsif (/^\#/) { 
                $sent++;
                next; 
        }
        else {
                s/\n//;
                @talvo = split(/ /,$_);
                $_ = <ARQ>;
                s/\n//;
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) 
                @tfonte = split(/\s+\}\)\s*/,$_);
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
                ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
                @ali = split(/\s/,$al);
                while ($#ali >= 0) { 
                        $talvo[shift(@ali)-1] .= ":0"; 
                }
                $i = 0;
                while ($i <= $#tfonte) {
                        ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
                        if ($al =~ /\d+/) {
                                @ali = split(/\s/,$al);
                                $tfonte[$i] = $t.":".join("_",@ali);
                                while ($#ali >= 0) { 
                                        if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { 
                                                $talvo[$ali[0]-1] .= ":".($i+1); 
                                        } else { 
                                                $talvo[$ali[0]-1] .= "_".($i+1); 
                                        }
                                        shift(@ali);
                                }
                        } else { 
                                $tfonte[$i] =~ s/\s*\(\{/:0/g; 
                        }
                        $i++;
                }

                map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);

                print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
                print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
        }
}
close OUTF;
close OUTA;