Difference between revisions of "Talk:ReTraTos"

Latest revision as of 22:44, 1 April 2008

Giza → LIHLA[edit]

$ cat giza_to_lihla.pl 
#!/usr/bin/perl
# Programa GIZA_to_LIHLA
# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
# Saida: Arquivos alinhados por GIZA no formato de LIHLA
# Funcao: Converte a saida de GIZA no padrao de LIHLA

use strict;
use locale;

if ($#ARGV < 2) {
    print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
    exit 1;
};

my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);

$entrada = shift(@ARGV);
$dirfonte = shift(@ARGV);
$diralvo = shift(@ARGV);

if ($dirfonte !~ /\/$/) { 
        $dirfonte .= '/'; 
}
if ($diralvo !~ /\/$/) { 
        $diralvo .= '/'; 
}

mkdir($dirfonte);
mkdir($diralvo);

print STDERR "Dir fonte: $dirfonte\n";
print STDERR "Dir alvo: $diralvo\n";

# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
# etiquetados_pos/es/ES-ci-abr03_01 
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) 

$ci = 0;
$fonte = $alvo = "";
$sent = -1;

open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
while (<ARQ>) {
        s/\n//g;
        #if (/^([^\s]+\/)+([^\/]+)$/) {
        if  ($ci == 0) {
                if (($fonte ne "") && ($alvo ne "")) { 
                        close OUTF; 
                        close OUTA; 
                }
                $sent = 0;
                $alvo = $diralvo.$entrada.$2;
                #print "$alvo\n";
                #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
                $alvo .= '.al';                         # e poe .al
                #print STDERR "Formatando arquivos $1 e ";
                #$t = <ARQ>;
                #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { 
                        $fonte = $dirfonte.$entrada.$1; 
                        #$fonte =~ s/\.\w+[\n\s]*$//g;  # remove a extensao original
                        $fonte .= '.al';                                # e poe .al
                #}
                #print STDERR "$1\n";
                #print STDERR "fonte: $fonte\n";
                #print STDERR "alvo: $alvo\n";
                open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
                open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
                $ci = 1;
        }
        elsif (/^\#/) { 
                $sent++;
                next; 
        }
        else {
                s/\n//;
                @talvo = split(/ /,$_);
                $_ = <ARQ>;
                s/\n//;
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) 
                @tfonte = split(/\s+\}\)\s*/,$_);
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
                ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
                @ali = split(/\s/,$al);
                while ($#ali >= 0) { 
                        $talvo[shift(@ali)-1] .= ":0"; 
                }
                $i = 0;
                while ($i <= $#tfonte) {
                        ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
                        if ($al =~ /\d+/) {
                                @ali = split(/\s/,$al);
                                $tfonte[$i] = $t.":".join("_",@ali);
                                while ($#ali >= 0) { 
                                        if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { 
                                                $talvo[$ali[0]-1] .= ":".($i+1); 
                                        } else { 
                                                $talvo[$ali[0]-1] .= "_".($i+1); 
                                        }
                                        shift(@ali);
                                }
                        } else { 
                                $tfonte[$i] =~ s/\s*\(\{/:0/g; 
                        }
                        $i++;
                }

                map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);

                print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
                print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
        }
}
close OUTF;
close OUTA;

Feature requests[edit]

Ability to specify constraints -- e.g. only allow nouns to be translated by nouns
Ability to turn off multiword generation altogether
Script to select "high quality" sentences from an aligned corpora -- e.g. strip out those with excess punctuation or numbers
Something to take into account an existing bilingual dictionary -- either as a bootstrap or something like this.

Difference between revisions of "Talk:ReTraTos"

Latest revision as of 22:44, 1 April 2008

Giza → LIHLA[edit]

Feature requests[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+==Giza → LIHLA==
-<pre>
-DESCRIPTION
-ReTraTos package is composed of two bilingual resources induction programs:
- - ReTraTos.pl: induces rules from corpora
- - ReTraTos_lex.pl: induces bilingual dictionaries from corpora
+<pre>
-At the moment there is no engine (in this package) to perform translation based
+$ cat giza_to_lihla.pl
-on the induced resources.
+#!/usr/bin/perl
+# Programa GIZA_to_LIHLA
+# Entrada: arquivo alinhado lexicalmente retornado por GIZA++
+# Saida: Arquivos alinhados por GIZA no formato de LIHLA
+# Funcao: Converte a saida de GIZA no padrao de LIHLA
+use strict;
-INPUT FORMAT
+use locale;
+if ($#ARGV < 2) {
-Two parallel texts are used as input for both inductors. In this text each sentence
+    print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n";
-has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence
+    exit 1;
-tag (<s>) has an attribute (snum) whose value is an identificator for this
+};
-sentence. Parallel sentences have the same identificator in source and target files.
+my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci);
- Example:
-  Source sentence
-  <s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s>
-  Target sentence (translation of source sentence identified as 1)
-  <s snum=1>targettoken1 targettoken2 ... targettokenn</s>
+$entrada = shift(@ARGV);
-Each token in each sentence has to be separated by a white space as show above.
+$dirfonte = shift(@ARGV);
-Each token can have at most 5 pieces of information:
+$diralvo = shift(@ARGV);
+if ($dirfonte !~ /\/$/) {
-. sur: the surface form of a word or a special character, that is,
+        $dirfonte .= '/';
-        the token as it was found in the original sentences. For example: houses,
+}
-        living and .
+if ($diralvo !~ /\/$/) {
+        $diralvo .= '/';
+}
+mkdir($dirfonte);
-. bas: the lemma of a word or a special character, a number, etc. when
+mkdir($diralvo);
-        it was tagged by the PoS tagger. For example: house, live and .
+print STDERR "Dir fonte: $dirfonte\n";
-. pos: PoS of lexical item according to the PoS tagger. The words unknown
+print STDERR "Dir alvo: $diralvo\n";
-        by the tagger (not tagged) and many special characters do not have this
-        information. For example: n (noun), vblex (verb) or nothing.
+# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598
-. atr: the value of each morphological attribute of a PoS tag. Each attribute
+# etiquetados_pos/es/ES-ci-abr03_01
-        value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund).
+# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 })
+$ci = 0;
-. ali: a sequence of one or more numbers (separated by "_") refering to the
+$fonte = $alvo = "";
-        positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0.
+$sent = -1;
+open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n";
-        This information is derived from preprocessing the parallel texts with at
+while (<ARQ>) {
-        least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali).
+        s/\n//g;
+        #if (/^([^\s]+\/)+([^\/]+)$/) {
+        if  ($ci == 0) {
+                if (($fonte ne "") && ($alvo ne "")) {
+                        close OUTF;
+                        close OUTA;
+                }
+                $sent = 0;
+                $alvo = $diralvo.$entrada.$2;
+                #print "$alvo\n";
+                #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original
+                $alvo .= '.al';                         # e poe .al
+                #print STDERR "Formatando arquivos $1 e ";
+                #$t = <ARQ>;
+                #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) {
+                        $fonte = $dirfonte.$entrada.$1;
+                        #$fonte =~ s/\.\w+[\n\s]*$//g;  # remove a extensao original
+                        $fonte .= '.al';                                # e poe .al
+                #}
+                #print STDERR "$1\n";
+                #print STDERR "fonte: $fonte\n";
+                #print STDERR "alvo: $alvo\n";
+                open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n";
+                open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n";
+                $ci = 1;
+        }
+        elsif (/^\#/) {
+                $sent++;
+                next;
+        }
+        else {
+                s/\n//;
+                @talvo = split(/ /,$_);
+                $_ = <ARQ>;
+                s/\n//;
+# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 })
+                @tfonte = split(/\s+\}\)\s*/,$_);
+# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5,
+                ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte));
+                @ali = split(/\s/,$al);
+                while ($#ali >= 0) {
+                        $talvo[shift(@ali)-1] .= ":0";
+                }
+                $i = 0;
+                while ($i <= $#tfonte) {
+                        ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]);
+                        if ($al =~ /\d+/) {
+                                @ali = split(/\s/,$al);
+                                $tfonte[$i] = $t.":".join("_",@ali);
+                                while ($#ali >= 0) {
+                                        if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) {
+                                                $talvo[$ali[0]-1] .= ":".($i+1);
+                                        } else {
+                                                $talvo[$ali[0]-1] .= "_".($i+1);
+                                        }
+                                        shift(@ali);
+                                }
+                        } else {
+                                $tfonte[$i] =~ s/\s*\(\{/:0/g;
+                        }
+                        $i++;
+                }
+                map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo);
-  The tokens are formated as shown below:
+                print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n";
-. \*sup/sup:ali
+                print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n";
-     Unknown words. For example: *piquia/piquia:4
-. sup:ali
+        }
+}
-     Special characters not tagged by the PoS tagger. For example: ":27
+close OUTF;
-. sup/C[\+C]*:ali
+close OUTA;
-     Other words and special characters tagged by the PoS tagger, in which
+</pre>
-     C = base<pos>A* e
-     A = [attribute]+
-     For example: houses/house<n><pl>:14, living/live<vblex><ger>:3,
-     cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25
-  Example of input parallel sentences:
-  Portuguese
-  <s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s>
-  English
-  <s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s>
+==Feature requests==
+* Ability to specify constraints -- e.g. only allow nouns to be translated by nouns
-OUTPUT FORMAT
+* Ability to turn off multiword generation altogether
+* Script to select "high quality" sentences from an aligned corpora -- e.g. strip out those with excess punctuation or numbers
-* Bilingual dictionaries are in a XML format very similiar to that used by
+* Something to take into account an existing bilingual dictionary -- either as a bootstrap or something like this.
-Apertium open-source machine translation platform (http://apertium.sourceforge.net/)
-* Transfer rules are in a human readable format and a new module are being
-developed to put them in the Apertium's XML format
-REQUIREMENTS
-* ReTraTos needs Perl installed in the system.
-QUICK START
-) Download the package for retratos-VERSION.tar.gz
-) Unpack retratos and do ('#' means 'do that with root privileges'):
-   $ cd retratos-VERSION
-   $ ./configure
-   $ make
-   # make install
-) Use the dictionary inductor (ReTraTos_lex.pl)
-   USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu]
-    -sourcefile|s sourcefile    file with examples in source language (required)
-    -targetfile|t targetfile    file with examples in target language (required)
-    -beginning|b  headerfile    file with the beginning of a bilingual dictionary (required)
-    -ending|e     footerfile    file with the ending of a bilingual dictionary (required)
-    -attrsfile|a  attfile       file with information about atributes (optional)
-    -multifreq|f  freqmwu       frequency threshold to filter multiword units (default=1)
-   Sample:
-   $ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50
-) Use the rule inductor (ReTraTos.pl)
-   USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v]
-    -sourcefile|s sourcefile  file with examples in source language (required)
-    -targetfile|t targetfile  file with examples in target language (required)
-    -type|ty      type        alignment type: 0, 1, 2 or 3 (all) (default=3)
-    -level|l      level       rules\' abstraction level(s) (default=pos)
-    -include_gra|ig inpos     PoS for which induce rules (default=all)
-    -exclude_gra|eg outpos    PoS for which do not induce rules (default=none)
-    -per_ident|pi percident   % for frequency threshold on pattern ident. (df=0.0015)
-    -filter|fi                determines if filter will be applied (default=no)
-    -per_filter|pf percfilt   % for frequency threshold on rule filtering (df=0.0015)
-    -sort|so                  determines if sorting will be done (default=no)
-    -remove|r                 remove auxiliary files
-    -verbose|v                verbose
-   Sample:
-   $ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so
-</pre>