Difference between revisions of "Talk:ReTraTos"

From Apertium
Jump to navigation Jump to search
(New page: <pre> DESCRIPTION ReTraTos package is composed of two bilingual resources induction programs: - ReTraTos.pl: induces rules from corpora - ReTraTos_lex.pl: induces bilingual dictionaries...)
 
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
==Giza → LIHLA==
<pre>
 
DESCRIPTION
 
   
ReTraTos package is composed of two bilingual resources induction programs:
 
- ReTraTos.pl: induces rules from corpora
 
- ReTraTos_lex.pl: induces bilingual dictionaries from corpora
 
   
  +
<pre>
At the moment there is no engine (in this package) to perform translation based
 
  +
$ cat giza_to_lihla.pl
on the induced resources.
 
  +
#!/usr/bin/perl
  +
# Programa GIZA_to_LIHLA
  +
# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
  +
# Saida: Arquivos alinhados por GIZA no formato de LIHLA
  +
# Funcao: Converte a saida de GIZA no padrao de LIHLA
   
  +
use strict;
INPUT FORMAT
 
  +
use locale;
   
  +
if ($#ARGV < 2) {
Two parallel texts are used as input for both inductors. In this text each sentence
 
  +
print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence
 
  +
exit 1;
tag (<s>) has an attribute (snum) whose value is an identificator for this
 
  +
};
sentence. Parallel sentences have the same identificator in source and target files.
 
   
  +
my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);
Example:
 
 
Source sentence
 
<s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s>
 
 
Target sentence (translation of source sentence identified as 1)
 
<s snum=1>targettoken1 targettoken2 ... targettokenn</s>
 
   
  +
$entrada = shift(@ARGV);
Each token in each sentence has to be separated by a white space as show above.
 
  +
$dirfonte = shift(@ARGV);
Each token can have at most 5 pieces of information:
 
  +
$diralvo = shift(@ARGV);
   
  +
if ($dirfonte !~ /\/$/) {
1. sur: the surface form of a word or a special character, that is,
 
  +
$dirfonte .= '/';
the token as it was found in the original sentences. For example: houses,
 
  +
}
living and .
 
  +
if ($diralvo !~ /\/$/) {
  +
$diralvo .= '/';
  +
}
   
  +
mkdir($dirfonte);
2. bas: the lemma of a word or a special character, a number, etc. when
 
  +
mkdir($diralvo);
it was tagged by the PoS tagger. For example: house, live and .
 
   
  +
print STDERR "Dir fonte: $dirfonte\n";
3. pos: PoS of lexical item according to the PoS tagger. The words unknown
 
  +
print STDERR "Dir alvo: $diralvo\n";
by the tagger (not tagged) and many special characters do not have this
 
information. For example: n (noun), vblex (verb) or nothing.
 
   
  +
# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
4. atr: the value of each morphological attribute of a PoS tag. Each attribute
 
  +
# etiquetados_pos/es/ES-ci-abr03_01
value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund).
 
  +
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 })
   
  +
$ci = 0;
5. ali: a sequence of one or more numbers (separated by "_") refering to the
 
  +
$fonte = $alvo = "";
positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0.
 
  +
$sent = -1;
   
  +
open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
This information is derived from preprocessing the parallel texts with at
 
  +
while (<ARQ>) {
least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali).
 
  +
s/\n//g;
  +
#if (/^([^\s]+\/)+([^\/]+)$/) {
  +
if ($ci == 0) {
  +
if (($fonte ne "") && ($alvo ne "")) {
  +
close OUTF;
  +
close OUTA;
  +
}
  +
$sent = 0;
  +
$alvo = $diralvo.$entrada.$2;
  +
#print "$alvo\n";
  +
#$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
  +
$alvo .= '.al'; # e poe .al
  +
#print STDERR "Formatando arquivos $1 e ";
  +
#$t = <ARQ>;
  +
#if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) {
  +
$fonte = $dirfonte.$entrada.$1;
  +
#$fonte =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
  +
$fonte .= '.al'; # e poe .al
  +
#}
  +
#print STDERR "$1\n";
  +
#print STDERR "fonte: $fonte\n";
  +
#print STDERR "alvo: $alvo\n";
  +
open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
  +
open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
  +
$ci = 1;
  +
}
  +
elsif (/^\#/) {
  +
$sent++;
  +
next;
  +
}
  +
else {
  +
s/\n//;
  +
@talvo = split(/ /,$_);
  +
$_ = <ARQ>;
  +
s/\n//;
  +
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 })
  +
@tfonte = split(/\s+\}\)\s*/,$_);
  +
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
  +
($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
  +
@ali = split(/\s/,$al);
  +
while ($#ali >= 0) {
  +
$talvo[shift(@ali)-1] .= ":0";
  +
}
  +
$i = 0;
  +
while ($i <= $#tfonte) {
  +
($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
  +
if ($al =~ /\d+/) {
  +
@ali = split(/\s/,$al);
  +
$tfonte[$i] = $t.":".join("_",@ali);
  +
while ($#ali >= 0) {
  +
if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) {
  +
$talvo[$ali[0]-1] .= ":".($i+1);
  +
} else {
  +
$talvo[$ali[0]-1] .= "_".($i+1);
  +
}
  +
shift(@ali);
  +
}
  +
} else {
  +
$tfonte[$i] =~ s/\s*\(\{/:0/g;
  +
}
  +
$i++;
  +
}
   
  +
map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);
The tokens are formated as shown below:
 
   
  +
print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
1. \*sup/sup:ali
 
  +
print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
Unknown words. For example: *piquia/piquia:4
 
2. sup:ali
+
}
  +
}
Special characters not tagged by the PoS tagger. For example: ":27
 
  +
close OUTF;
3. sup/C[\+C]*:ali
 
  +
close OUTA;
Other words and special characters tagged by the PoS tagger, in which
 
  +
</pre>
C = base<pos>A* e
 
A = [attribute]+
 
For example: houses/house<n><pl>:14, living/live<vblex><ger>:3,
 
cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25
 
 
Example of input parallel sentences:
 
 
Portuguese
 
<s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s>
 
 
English
 
<s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s>
 
   
  +
==Feature requests==
   
  +
* Ability to specify constraints -- e.g. only allow nouns to be translated by nouns
OUTPUT FORMAT
 
  +
* Ability to turn off multiword generation altogether
 
  +
* Script to select "high quality" sentences from an aligned corpora -- e.g. strip out those with excess punctuation or numbers
* Bilingual dictionaries are in a XML format very similiar to that used by
 
  +
* Something to take into account an existing bilingual dictionary -- either as a bootstrap or something like this.
Apertium open-source machine translation platform (http://apertium.sourceforge.net/)
 
 
* Transfer rules are in a human readable format and a new module are being
 
developed to put them in the Apertium's XML format
 
 
REQUIREMENTS
 
 
* ReTraTos needs Perl installed in the system.
 
 
QUICK START
 
 
1) Download the package for retratos-VERSION.tar.gz
 
 
2) Unpack retratos and do ('#' means 'do that with root privileges'):
 
$ cd retratos-VERSION
 
$ ./configure
 
$ make
 
# make install
 
 
3) Use the dictionary inductor (ReTraTos_lex.pl)
 
 
USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu]
 
-sourcefile|s sourcefile file with examples in source language (required)
 
-targetfile|t targetfile file with examples in target language (required)
 
-beginning|b headerfile file with the beginning of a bilingual dictionary (required)
 
-ending|e footerfile file with the ending of a bilingual dictionary (required)
 
-attrsfile|a attfile file with information about atributes (optional)
 
-multifreq|f freqmwu frequency threshold to filter multiword units (default=1)
 
 
Sample:
 
 
$ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50
 
 
4) Use the rule inductor (ReTraTos.pl)
 
 
USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v]
 
-sourcefile|s sourcefile file with examples in source language (required)
 
-targetfile|t targetfile file with examples in target language (required)
 
-type|ty type alignment type: 0, 1, 2 or 3 (all) (default=3)
 
-level|l level rules\' abstraction level(s) (default=pos)
 
-include_gra|ig inpos PoS for which induce rules (default=all)
 
-exclude_gra|eg outpos PoS for which do not induce rules (default=none)
 
-per_ident|pi percident % for frequency threshold on pattern ident. (df=0.0015)
 
-filter|fi determines if filter will be applied (default=no)
 
-per_filter|pf percfilt % for frequency threshold on rule filtering (df=0.0015)
 
-sort|so determines if sorting will be done (default=no)
 
-remove|r remove auxiliary files
 
-verbose|v verbose
 
 
Sample:
 
 
$ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so
 
 
</pre>
 

Latest revision as of 22:44, 1 April 2008

Giza → LIHLA[edit]

$ cat giza_to_lihla.pl 
#!/usr/bin/perl
# Programa GIZA_to_LIHLA
# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
# Saida: Arquivos alinhados por GIZA no formato de LIHLA
# Funcao: Converte a saida de GIZA no padrao de LIHLA

use strict;
use locale;

if ($#ARGV < 2) {
    print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
    exit 1;
};

my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);

$entrada = shift(@ARGV);
$dirfonte = shift(@ARGV);
$diralvo = shift(@ARGV);

if ($dirfonte !~ /\/$/) { 
        $dirfonte .= '/'; 
}
if ($diralvo !~ /\/$/) { 
        $diralvo .= '/'; 
}

mkdir($dirfonte);
mkdir($diralvo);

print STDERR "Dir fonte: $dirfonte\n";
print STDERR "Dir alvo: $diralvo\n";

# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
# etiquetados_pos/es/ES-ci-abr03_01 
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) 

$ci = 0;
$fonte = $alvo = "";
$sent = -1;

open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
while (<ARQ>) {
        s/\n//g;
        #if (/^([^\s]+\/)+([^\/]+)$/) {
        if  ($ci == 0) {
                if (($fonte ne "") && ($alvo ne "")) { 
                        close OUTF; 
                        close OUTA; 
                }
                $sent = 0;
                $alvo = $diralvo.$entrada.$2;
                #print "$alvo\n";
                #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
                $alvo .= '.al';                         # e poe .al
                #print STDERR "Formatando arquivos $1 e ";
                #$t = <ARQ>;
                #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { 
                        $fonte = $dirfonte.$entrada.$1; 
                        #$fonte =~ s/\.\w+[\n\s]*$//g;  # remove a extensao original
                        $fonte .= '.al';                                # e poe .al
                #}
                #print STDERR "$1\n";
                #print STDERR "fonte: $fonte\n";
                #print STDERR "alvo: $alvo\n";
                open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
                open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
                $ci = 1;
        }
        elsif (/^\#/) { 
                $sent++;
                next; 
        }
        else {
                s/\n//;
                @talvo = split(/ /,$_);
                $_ = <ARQ>;
                s/\n//;
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) 
                @tfonte = split(/\s+\}\)\s*/,$_);
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
                ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
                @ali = split(/\s/,$al);
                while ($#ali >= 0) { 
                        $talvo[shift(@ali)-1] .= ":0"; 
                }
                $i = 0;
                while ($i <= $#tfonte) {
                        ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
                        if ($al =~ /\d+/) {
                                @ali = split(/\s/,$al);
                                $tfonte[$i] = $t.":".join("_",@ali);
                                while ($#ali >= 0) { 
                                        if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { 
                                                $talvo[$ali[0]-1] .= ":".($i+1); 
                                        } else { 
                                                $talvo[$ali[0]-1] .= "_".($i+1); 
                                        }
                                        shift(@ali);
                                }
                        } else { 
                                $tfonte[$i] =~ s/\s*\(\{/:0/g; 
                        }
                        $i++;
                }

                map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);

                print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
                print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
        }
}
close OUTF;
close OUTA;

Feature requests[edit]

  • Ability to specify constraints -- e.g. only allow nouns to be translated by nouns
  • Ability to turn off multiword generation altogether
  • Script to select "high quality" sentences from an aligned corpora -- e.g. strip out those with excess punctuation or numbers
  • Something to take into account an existing bilingual dictionary -- either as a bootstrap or something like this.