Difference between revisions of "Talk:ReTraTos"
Jump to navigation
Jump to search
(New page: <pre> DESCRIPTION ReTraTos package is composed of two bilingual resources induction programs: - ReTraTos.pl: induces rules from corpora - ReTraTos_lex.pl: induces bilingual dictionaries...) |
|||
Line 1: | Line 1: | ||
+ | ==Giza → LIHLA== |
||
+ | |||
<pre> |
<pre> |
||
+ | #!/usr/bin/perl |
||
− | DESCRIPTION |
||
+ | # Programa GIZA_to_LIHLA |
||
− | |||
+ | # Entrada: arquivo alinhado lexicalmente retornado por GIZA++ |
||
− | ReTraTos package is composed of two bilingual resources induction programs: |
||
+ | # Saida: Arquivos alinhados por GIZA no formato de LIHLA |
||
− | - ReTraTos.pl: induces rules from corpora |
||
+ | # Funcao: Converte a saida de GIZA no padrao de LIHLA |
||
− | - ReTraTos_lex.pl: induces bilingual dictionaries from corpora |
||
− | |||
− | At the moment there is no engine (in this package) to perform translation based |
||
− | on the induced resources. |
||
− | |||
− | INPUT FORMAT |
||
− | |||
− | Two parallel texts are used as input for both inductors. In this text each sentence |
||
− | has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence |
||
− | tag (<s>) has an attribute (snum) whose value is an identificator for this |
||
− | sentence. Parallel sentences have the same identificator in source and target files. |
||
− | |||
− | Example: |
||
− | |||
− | Source sentence |
||
− | <s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s> |
||
− | |||
− | Target sentence (translation of source sentence identified as 1) |
||
− | <s snum=1>targettoken1 targettoken2 ... targettokenn</s> |
||
− | |||
− | Each token in each sentence has to be separated by a white space as show above. |
||
− | Each token can have at most 5 pieces of information: |
||
− | |||
− | 1. sur: the surface form of a word or a special character, that is, |
||
− | the token as it was found in the original sentences. For example: houses, |
||
− | living and . |
||
− | |||
− | 2. bas: the lemma of a word or a special character, a number, etc. when |
||
− | it was tagged by the PoS tagger. For example: house, live and . |
||
− | |||
− | 3. pos: PoS of lexical item according to the PoS tagger. The words unknown |
||
− | by the tagger (not tagged) and many special characters do not have this |
||
− | information. For example: n (noun), vblex (verb) or nothing. |
||
− | |||
− | 4. atr: the value of each morphological attribute of a PoS tag. Each attribute |
||
− | value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund). |
||
− | |||
− | 5. ali: a sequence of one or more numbers (separated by "_") refering to the |
||
− | positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0. |
||
− | |||
− | This information is derived from preprocessing the parallel texts with at |
||
− | least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali). |
||
− | |||
− | The tokens are formated as shown below: |
||
− | |||
− | 1. \*sup/sup:ali |
||
− | Unknown words. For example: *piquia/piquia:4 |
||
− | 2. sup:ali |
||
− | Special characters not tagged by the PoS tagger. For example: ":27 |
||
− | 3. sup/C[\+C]*:ali |
||
− | Other words and special characters tagged by the PoS tagger, in which |
||
− | C = base<pos>A* e |
||
− | A = [attribute]+ |
||
− | For example: houses/house<n><pl>:14, living/live<vblex><ger>:3, |
||
− | cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25 |
||
− | |||
− | Example of input parallel sentences: |
||
− | |||
− | Portuguese |
||
− | <s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s> |
||
− | |||
− | English |
||
− | <s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s> |
||
− | |||
− | |||
− | OUTPUT FORMAT |
||
− | |||
− | * Bilingual dictionaries are in a XML format very similiar to that used by |
||
− | Apertium open-source machine translation platform (http://apertium.sourceforge.net/) |
||
+ | use strict; |
||
− | * Transfer rules are in a human readable format and a new module are being |
||
+ | use locale; |
||
− | developed to put them in the Apertium's XML format |
||
+ | if ($#ARGV < 2) { |
||
− | REQUIREMENTS |
||
+ | print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n"; |
||
+ | exit 1; |
||
+ | }; |
||
+ | my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci); |
||
− | * ReTraTos needs Perl installed in the system. |
||
+ | $entrada = shift(@ARGV); |
||
− | QUICK START |
||
+ | $dirfonte = shift(@ARGV); |
||
+ | $diralvo = shift(@ARGV); |
||
+ | if ($dirfonte !~ /\/$/) { |
||
− | 1) Download the package for retratos-VERSION.tar.gz |
||
+ | $dirfonte .= '/'; |
||
+ | } |
||
+ | if ($diralvo !~ /\/$/) { |
||
+ | $diralvo .= '/'; |
||
+ | } |
||
+ | mkdir($dirfonte); |
||
− | 2) Unpack retratos and do ('#' means 'do that with root privileges'): |
||
+ | mkdir($diralvo); |
||
− | $ cd retratos-VERSION |
||
− | $ ./configure |
||
− | $ make |
||
− | # make install |
||
+ | print STDERR "Dir fonte: $dirfonte\n"; |
||
− | 3) Use the dictionary inductor (ReTraTos_lex.pl) |
||
+ | print STDERR "Dir alvo: $diralvo\n"; |
||
− | |||
− | USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu] |
||
− | -sourcefile|s sourcefile file with examples in source language (required) |
||
− | -targetfile|t targetfile file with examples in target language (required) |
||
− | -beginning|b headerfile file with the beginning of a bilingual dictionary (required) |
||
− | -ending|e footerfile file with the ending of a bilingual dictionary (required) |
||
− | -attrsfile|a attfile file with information about atributes (optional) |
||
− | -multifreq|f freqmwu frequency threshold to filter multiword units (default=1) |
||
+ | # Sentence pair (1) source length 1 target length 1 alignment score : 0.977598 |
||
− | Sample: |
||
+ | # etiquetados_pos/es/ES-ci-abr03_01 |
||
+ | # NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) |
||
+ | $ci = 0; |
||
− | $ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50 |
||
+ | $fonte = $alvo = ""; |
||
+ | $sent = -1; |
||
+ | open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n"; |
||
− | 4) Use the rule inductor (ReTraTos.pl) |
||
+ | while (<ARQ>) { |
||
− | |||
+ | s/\n//g; |
||
− | USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v] |
||
+ | #if (/^([^\s]+\/)+([^\/]+)$/) { |
||
− | -sourcefile|s sourcefile file with examples in source language (required) |
||
+ | if ($ci == 0) { |
||
− | -targetfile|t targetfile file with examples in target language (required) |
||
− | + | if (($fonte ne "") && ($alvo ne "")) { |
|
− | + | close OUTF; |
|
+ | close OUTA; |
||
− | -include_gra|ig inpos PoS for which induce rules (default=all) |
||
− | + | } |
|
+ | $sent = 0; |
||
− | -per_ident|pi percident % for frequency threshold on pattern ident. (df=0.0015) |
||
− | + | $alvo = $diralvo.$entrada.$2; |
|
+ | #print "$alvo\n"; |
||
− | -per_filter|pf percfilt % for frequency threshold on rule filtering (df=0.0015) |
||
− | + | #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original |
|
− | + | $alvo .= '.al'; # e poe .al |
|
− | + | #print STDERR "Formatando arquivos $1 e "; |
|
+ | #$t = <ARQ>; |
||
+ | #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { |
||
+ | $fonte = $dirfonte.$entrada.$1; |
||
+ | #$fonte =~ s/\.\w+[\n\s]*$//g; # remove a extensao original |
||
+ | $fonte .= '.al'; # e poe .al |
||
+ | #} |
||
+ | #print STDERR "$1\n"; |
||
+ | #print STDERR "fonte: $fonte\n"; |
||
+ | #print STDERR "alvo: $alvo\n"; |
||
+ | open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n"; |
||
+ | open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n"; |
||
+ | $ci = 1; |
||
+ | } |
||
+ | elsif (/^\#/) { |
||
+ | $sent++; |
||
+ | next; |
||
+ | } |
||
+ | else { |
||
+ | s/\n//; |
||
+ | @talvo = split(/ /,$_); |
||
+ | $_ = <ARQ>; |
||
+ | s/\n//; |
||
+ | # NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) |
||
+ | @tfonte = split(/\s+\}\)\s*/,$_); |
||
+ | # NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5, |
||
+ | ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte)); |
||
+ | @ali = split(/\s/,$al); |
||
+ | while ($#ali >= 0) { |
||
+ | $talvo[shift(@ali)-1] .= ":0"; |
||
+ | } |
||
+ | $i = 0; |
||
+ | while ($i <= $#tfonte) { |
||
+ | ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]); |
||
+ | if ($al =~ /\d+/) { |
||
+ | @ali = split(/\s/,$al); |
||
+ | $tfonte[$i] = $t.":".join("_",@ali); |
||
+ | while ($#ali >= 0) { |
||
+ | if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { |
||
+ | $talvo[$ali[0]-1] .= ":".($i+1); |
||
+ | } else { |
||
+ | $talvo[$ali[0]-1] .= "_".($i+1); |
||
+ | } |
||
+ | shift(@ali); |
||
+ | } |
||
+ | } else { |
||
+ | $tfonte[$i] =~ s/\s*\(\{/:0/g; |
||
+ | } |
||
+ | $i++; |
||
+ | } |
||
+ | map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo); |
||
− | Sample: |
||
+ | print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n"; |
||
− | $ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so |
||
+ | print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n"; |
||
+ | } |
||
+ | } |
||
+ | close OUTF; |
||
+ | close OUTA; |
||
</pre> |
</pre> |
Revision as of 11:30, 20 March 2008
Giza → LIHLA
#!/usr/bin/perl # Programa GIZA_to_LIHLA # Entrada: arquivo alinhado lexicalmente retornado por GIZA++ # Saida: Arquivos alinhados por GIZA no formato de LIHLA # Funcao: Converte a saida de GIZA no padrao de LIHLA use strict; use locale; if ($#ARGV < 2) { print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n"; exit 1; }; my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci); $entrada = shift(@ARGV); $dirfonte = shift(@ARGV); $diralvo = shift(@ARGV); if ($dirfonte !~ /\/$/) { $dirfonte .= '/'; } if ($diralvo !~ /\/$/) { $diralvo .= '/'; } mkdir($dirfonte); mkdir($diralvo); print STDERR "Dir fonte: $dirfonte\n"; print STDERR "Dir alvo: $diralvo\n"; # Sentence pair (1) source length 1 target length 1 alignment score : 0.977598 # etiquetados_pos/es/ES-ci-abr03_01 # NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) $ci = 0; $fonte = $alvo = ""; $sent = -1; open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n"; while (<ARQ>) { s/\n//g; #if (/^([^\s]+\/)+([^\/]+)$/) { if ($ci == 0) { if (($fonte ne "") && ($alvo ne "")) { close OUTF; close OUTA; } $sent = 0; $alvo = $diralvo.$entrada.$2; #print "$alvo\n"; #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original $alvo .= '.al'; # e poe .al #print STDERR "Formatando arquivos $1 e "; #$t = <ARQ>; #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { $fonte = $dirfonte.$entrada.$1; #$fonte =~ s/\.\w+[\n\s]*$//g; # remove a extensao original $fonte .= '.al'; # e poe .al #} #print STDERR "$1\n"; #print STDERR "fonte: $fonte\n"; #print STDERR "alvo: $alvo\n"; open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n"; open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n"; $ci = 1; } elsif (/^\#/) { $sent++; next; } else { s/\n//; @talvo = split(/ /,$_); $_ = <ARQ>; s/\n//; # NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) @tfonte = split(/\s+\}\)\s*/,$_); # NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5, ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte)); @ali = split(/\s/,$al); while ($#ali >= 0) { $talvo[shift(@ali)-1] .= ":0"; } $i = 0; while ($i <= $#tfonte) { ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]); if ($al =~ /\d+/) { @ali = split(/\s/,$al); $tfonte[$i] = $t.":".join("_",@ali); while ($#ali >= 0) { if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { $talvo[$ali[0]-1] .= ":".($i+1); } else { $talvo[$ali[0]-1] .= "_".($i+1); } shift(@ali); } } else { $tfonte[$i] =~ s/\s*\(\{/:0/g; } $i++; } map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo); print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n"; print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n"; } } close OUTF; close OUTA;