Difference between revisions of "Talk:ReTraTos"
Jump to navigation
Jump to search
(New page: <pre> DESCRIPTION ReTraTos package is composed of two bilingual resources induction programs: - ReTraTos.pl: induces rules from corpora - ReTraTos_lex.pl: induces bilingual dictionaries...) |
|||
Line 1: | Line 1: | ||
==Giza → LIHLA== |
|||
<pre> |
<pre> |
||
#!/usr/bin/perl |
|||
DESCRIPTION |
|||
# Programa GIZA_to_LIHLA |
|||
# Entrada: arquivo alinhado lexicalmente retornado por GIZA++ |
|||
ReTraTos package is composed of two bilingual resources induction programs: |
|||
# Saida: Arquivos alinhados por GIZA no formato de LIHLA |
|||
- ReTraTos.pl: induces rules from corpora |
|||
# Funcao: Converte a saida de GIZA no padrao de LIHLA |
|||
- ReTraTos_lex.pl: induces bilingual dictionaries from corpora |
|||
At the moment there is no engine (in this package) to perform translation based |
|||
on the induced resources. |
|||
INPUT FORMAT |
|||
Two parallel texts are used as input for both inductors. In this text each sentence |
|||
has to be tagged with initial (<s>) and final (</s>) tags. The initial sentence |
|||
tag (<s>) has an attribute (snum) whose value is an identificator for this |
|||
sentence. Parallel sentences have the same identificator in source and target files. |
|||
Example: |
|||
Source sentence |
|||
<s snum=1>sourcetoken1 sourcetoken2 ... sourcetokenn</s> |
|||
Target sentence (translation of source sentence identified as 1) |
|||
<s snum=1>targettoken1 targettoken2 ... targettokenn</s> |
|||
Each token in each sentence has to be separated by a white space as show above. |
|||
Each token can have at most 5 pieces of information: |
|||
1. sur: the surface form of a word or a special character, that is, |
|||
the token as it was found in the original sentences. For example: houses, |
|||
living and . |
|||
2. bas: the lemma of a word or a special character, a number, etc. when |
|||
it was tagged by the PoS tagger. For example: house, live and . |
|||
3. pos: PoS of lexical item according to the PoS tagger. The words unknown |
|||
by the tagger (not tagged) and many special characters do not have this |
|||
information. For example: n (noun), vblex (verb) or nothing. |
|||
4. atr: the value of each morphological attribute of a PoS tag. Each attribute |
|||
value has to be between "<" and ">". For example: <pl> (plural), <ger> (gerund). |
|||
5. ali: a sequence of one or more numbers (separated by "_") refering to the |
|||
positions of aligned items in the parallel sentences. For example: 14, 3, 7_8, 0. |
|||
This information is derived from preprocessing the parallel texts with at |
|||
least 2 tools: a PoS tagger (bas, pos and atr) and a lexical aligner (ali). |
|||
The tokens are formated as shown below: |
|||
1. \*sup/sup:ali |
|||
Unknown words. For example: *piquia/piquia:4 |
|||
2. sup:ali |
|||
Special characters not tagged by the PoS tagger. For example: ":27 |
|||
3. sup/C[\+C]*:ali |
|||
Other words and special characters tagged by the PoS tagger, in which |
|||
C = base<pos>A* e |
|||
A = [attribute]+ |
|||
For example: houses/house<n><pl>:14, living/live<vblex><ger>:3, |
|||
cannot/can<vaux><pres>+not<adv>:7_8, ,/,<cm>:25 |
|||
Example of input parallel sentences: |
|||
Portuguese |
|||
<s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 mais/mais<adv>:5 antigo/antigo<adj><m><sg>:5 colégio/colégio<n><m><sg>:6 de/de<pr>:7 São_Paulo/São_Paulo<np><loc>:8_9 </s> |
|||
English |
|||
<s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 oldest/old<adj><sint><sup>:4_5 school/school<n><pl>:6 of/of<pr>:7 *São/São:8 *Paulo/Paulo:8 </s> |
|||
OUTPUT FORMAT |
|||
* Bilingual dictionaries are in a XML format very similiar to that used by |
|||
Apertium open-source machine translation platform (http://apertium.sourceforge.net/) |
|||
use strict; |
|||
* Transfer rules are in a human readable format and a new module are being |
|||
use locale; |
|||
developed to put them in the Apertium's XML format |
|||
if ($#ARGV < 2) { |
|||
REQUIREMENTS |
|||
print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n"; |
|||
exit 1; |
|||
}; |
|||
my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci); |
|||
* ReTraTos needs Perl installed in the system. |
|||
$entrada = shift(@ARGV); |
|||
QUICK START |
|||
$dirfonte = shift(@ARGV); |
|||
$diralvo = shift(@ARGV); |
|||
if ($dirfonte !~ /\/$/) { |
|||
1) Download the package for retratos-VERSION.tar.gz |
|||
$dirfonte .= '/'; |
|||
} |
|||
if ($diralvo !~ /\/$/) { |
|||
$diralvo .= '/'; |
|||
} |
|||
mkdir($dirfonte); |
|||
2) Unpack retratos and do ('#' means 'do that with root privileges'): |
|||
mkdir($diralvo); |
|||
$ cd retratos-VERSION |
|||
$ ./configure |
|||
$ make |
|||
# make install |
|||
print STDERR "Dir fonte: $dirfonte\n"; |
|||
3) Use the dictionary inductor (ReTraTos_lex.pl) |
|||
print STDERR "Dir alvo: $diralvo\n"; |
|||
USAGE: perl ReTraTos_lex.pl -s sorcefile -t targetfile -b headerfile -e footerfile [-a attfile] [-f freqmwu] |
|||
-sourcefile|s sourcefile file with examples in source language (required) |
|||
-targetfile|t targetfile file with examples in target language (required) |
|||
-beginning|b headerfile file with the beginning of a bilingual dictionary (required) |
|||
-ending|e footerfile file with the ending of a bilingual dictionary (required) |
|||
-attrsfile|a attfile file with information about atributes (optional) |
|||
-multifreq|f freqmwu frequency threshold to filter multiword units (default=1) |
|||
# Sentence pair (1) source length 1 target length 1 alignment score : 0.977598 |
|||
Sample: |
|||
# etiquetados_pos/es/ES-ci-abr03_01 |
|||
# NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) |
|||
$ci = 0; |
|||
$ perl ReTraTos_lex.pl -s test/pt.txt -t test/en.txt -b test/dic_header.txt -e test/dic_footer.txt -f 50 |
|||
$fonte = $alvo = ""; |
|||
$sent = -1; |
|||
open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n"; |
|||
4) Use the rule inductor (ReTraTos.pl) |
|||
while (<ARQ>) { |
|||
s/\n//g; |
|||
USAGE: perl ReTraTos.pl -s sourcefile -t targetfile [-ty type] [-l level] [-ig inpos] [-eg outpos] [-pi percident] [-fi] [-pf percfilt] [-so] [-r] [-v] |
|||
#if (/^([^\s]+\/)+([^\/]+)$/) { |
|||
-sourcefile|s sourcefile file with examples in source language (required) |
|||
if ($ci == 0) { |
|||
-targetfile|t targetfile file with examples in target language (required) |
|||
if (($fonte ne "") && ($alvo ne "")) { |
|||
close OUTF; |
|||
close OUTA; |
|||
-include_gra|ig inpos PoS for which induce rules (default=all) |
|||
} |
|||
$sent = 0; |
|||
-per_ident|pi percident % for frequency threshold on pattern ident. (df=0.0015) |
|||
$alvo = $diralvo.$entrada.$2; |
|||
#print "$alvo\n"; |
|||
-per_filter|pf percfilt % for frequency threshold on rule filtering (df=0.0015) |
|||
#$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original |
|||
$alvo .= '.al'; # e poe .al |
|||
#print STDERR "Formatando arquivos $1 e "; |
|||
#$t = <ARQ>; |
|||
#if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { |
|||
$fonte = $dirfonte.$entrada.$1; |
|||
#$fonte =~ s/\.\w+[\n\s]*$//g; # remove a extensao original |
|||
$fonte .= '.al'; # e poe .al |
|||
#} |
|||
#print STDERR "$1\n"; |
|||
#print STDERR "fonte: $fonte\n"; |
|||
#print STDERR "alvo: $alvo\n"; |
|||
open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n"; |
|||
open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n"; |
|||
$ci = 1; |
|||
} |
|||
elsif (/^\#/) { |
|||
$sent++; |
|||
next; |
|||
} |
|||
else { |
|||
s/\n//; |
|||
@talvo = split(/ /,$_); |
|||
$_ = <ARQ>; |
|||
s/\n//; |
|||
# NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) |
|||
@tfonte = split(/\s+\}\)\s*/,$_); |
|||
# NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5, |
|||
($t,$al) = split(/\s\(\{\s*/,shift(@tfonte)); |
|||
@ali = split(/\s/,$al); |
|||
while ($#ali >= 0) { |
|||
$talvo[shift(@ali)-1] .= ":0"; |
|||
} |
|||
$i = 0; |
|||
while ($i <= $#tfonte) { |
|||
($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]); |
|||
if ($al =~ /\d+/) { |
|||
@ali = split(/\s/,$al); |
|||
$tfonte[$i] = $t.":".join("_",@ali); |
|||
while ($#ali >= 0) { |
|||
if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { |
|||
$talvo[$ali[0]-1] .= ":".($i+1); |
|||
} else { |
|||
$talvo[$ali[0]-1] .= "_".($i+1); |
|||
} |
|||
shift(@ali); |
|||
} |
|||
} else { |
|||
$tfonte[$i] =~ s/\s*\(\{/:0/g; |
|||
} |
|||
$i++; |
|||
} |
|||
map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo); |
|||
Sample: |
|||
print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n"; |
|||
$ perl ReTraTos.pl -s test/pt.txt -t test/en.txt -f 0.0007 -eg cm -fi -so |
|||
print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n"; |
|||
} |
|||
} |
|||
close OUTF; |
|||
close OUTA; |
|||
</pre> |
</pre> |
Revision as of 11:30, 20 March 2008
Giza → LIHLA
#!/usr/bin/perl # Programa GIZA_to_LIHLA # Entrada: arquivo alinhado lexicalmente retornado por GIZA++ # Saida: Arquivos alinhados por GIZA no formato de LIHLA # Funcao: Converte a saida de GIZA no padrao de LIHLA use strict; use locale; if ($#ARGV < 2) { print "Uso: $0 <entrada> <dir_fonte> <dir_alvo>\n"; exit 1; }; my ($entrada,$fonte,$alvo,@tfonte,@talvo,$t,$al,@ali,$i,$sent,$dirfonte,$diralvo,$ci); $entrada = shift(@ARGV); $dirfonte = shift(@ARGV); $diralvo = shift(@ARGV); if ($dirfonte !~ /\/$/) { $dirfonte .= '/'; } if ($diralvo !~ /\/$/) { $diralvo .= '/'; } mkdir($dirfonte); mkdir($diralvo); print STDERR "Dir fonte: $dirfonte\n"; print STDERR "Dir alvo: $diralvo\n"; # Sentence pair (1) source length 1 target length 1 alignment score : 0.977598 # etiquetados_pos/es/ES-ci-abr03_01 # NULL ({ }) etiquetados_pos/pt/ParC-PB-o_RE-IF-F-ci-abr03_01 ({ 1 }) $ci = 0; $fonte = $alvo = ""; $sent = -1; open(ARQ,$entrada) or die "Nao eh possivel abrir o arquivo $entrada\n"; while (<ARQ>) { s/\n//g; #if (/^([^\s]+\/)+([^\/]+)$/) { if ($ci == 0) { if (($fonte ne "") && ($alvo ne "")) { close OUTF; close OUTA; } $sent = 0; $alvo = $diralvo.$entrada.$2; #print "$alvo\n"; #$alvo =~ s/\.\w+[\n\s]*$//g; # remove a extensao original $alvo .= '.al'; # e poe .al #print STDERR "Formatando arquivos $1 e "; #$t = <ARQ>; #if ($t =~ /^.+\/([^\/]+)\s+\(\{.+\}\).*$/) { $fonte = $dirfonte.$entrada.$1; #$fonte =~ s/\.\w+[\n\s]*$//g; # remove a extensao original $fonte .= '.al'; # e poe .al #} #print STDERR "$1\n"; #print STDERR "fonte: $fonte\n"; #print STDERR "alvo: $alvo\n"; open(OUTF,">$fonte") or die "Nao eh possivel abrir o arquivo $fonte\n"; open(OUTA,">$alvo") or die "Nao eh possivel abrir o arquivo $alvo\n"; $ci = 1; } elsif (/^\#/) { $sent++; next; } else { s/\n//; @talvo = split(/ /,$_); $_ = <ARQ>; s/\n//; # NULL ({ 3 }) balão ({ 1 }) analisar ({ 2 }) atmosfera ({ 4 }) tropical ({ 5 }) @tfonte = split(/\s+\}\)\s*/,$_); # NULL ({ 3,balão ({ 1,analisar ({ 2,atmosfera ({ 4,tropical ({ 5, ($t,$al) = split(/\s\(\{\s*/,shift(@tfonte)); @ali = split(/\s/,$al); while ($#ali >= 0) { $talvo[shift(@ali)-1] .= ":0"; } $i = 0; while ($i <= $#tfonte) { ($t,$al) = split(/\s\(\{\s*/,$tfonte[$i]); if ($al =~ /\d+/) { @ali = split(/\s/,$al); $tfonte[$i] = $t.":".join("_",@ali); while ($#ali >= 0) { if (($talvo[$ali[0]-1] !~ /:/) || ($talvo[$ali[0]-1] eq ":")) { $talvo[$ali[0]-1] .= ":".($i+1); } else { $talvo[$ali[0]-1] .= "_".($i+1); } shift(@ali); } } else { $tfonte[$i] =~ s/\s*\(\{/:0/g; } $i++; } map($_ !~ /\:\d+(\_\d+)*$/ ? $_ .= ":0" : (),@talvo); print OUTF "<s snum=$sent>",join(" ",@tfonte),"</s>\n"; print OUTA "<s snum=$sent>",join(" ",@talvo),"</s>\n"; } } close OUTF; close OUTA;