Difference between revisions of "ReTraTos"

From Apertium
Jump to navigation Jump to search
Line 2: Line 2:
 
'''ReTraTos''' is a toolbox to build linguistic resources useful for machine translation (MT): bilingual dictionaries and transfer rules. The induction systems and open linguistic data can be used with the [[Apertium]] toolbox to build open-source MT systems.
 
'''ReTraTos''' is a toolbox to build linguistic resources useful for machine translation (MT): bilingual dictionaries and transfer rules. The induction systems and open linguistic data can be used with the [[Apertium]] toolbox to build open-source MT systems.
   
  +
==Bilingual dictionaries==
==Input format==
 
   
  +
This section describes how to use ReTraTos to create a bilingual dictionary for your Apertium language pair. You will need:
The input sentences need to be given in two separate files, for example <code>en.txt</code> for English and <code>pt.txt</code> for Portuguese.
 
   
  +
* An aligned [[corpus]] of the two languages. For any pair of european languages, the JRC-Acquis corpus is recommended.
;pt.txt
 
  +
* [[GIZA++]]
<pre><nowiki>
 
  +
* ReTraTos
<s snum=1>Os/O<det><def><m><pl>:1 alunos/aluno<n><m><pl>:2 do/de<pr>+o<det><def><m><sg>:3_4 colégio/colégio<n><m><sg>:5
 
  +
* a lot of patience
 
...
 
</nowiki></pre>
 
   
  +
===Preparing the corpus===
;en.txt
 
<pre><nowiki>
 
<s snum=1>The/The<det><def><sp>:1 students/student<n><pl>:2 of/of<pr>:3 the/the<det><def><sp>:3 school/school<n><pl>:5
 
   
  +
The corpus should be in two files, each with one sentence per line. For example,
...
 
</nowiki></pre>
 
   
 
;es.txt
==Usage==
 
  +
<pre>
  +
Reconociendo que , en particular , sería mutuamente beneficioso cooperar mediante el establecimiento de un programa común de investigaciones y de
  +
desarrollo ;
  +
Considerando que un acuerdo que establezca una cooperación en el ámbito de las utilizaciones pacíficas de la energía atómica iniciaría fructíferos
  +
intercambios de experiencia
 
...
  +
</pre>
 
;it.txt
  +
<pre>
  +
Riconoscendo in particolare che sarebbe loro reciproco vantaggio cooperare con lo stabilire un programma comune di ricerche e di
  +
sviluppo ;
  +
Considerando che un accordo inteso a stabilire una cooperazione nel campo degli usi pacifici dell'energia atomica darebbe inizio ad un proficuo
  +
scambio di esperienze
 
...
  +
</pre>
  +
  +
Depending on if you want to create entries for proper names, you could lower-case the whole corpus. Make sure that there are spaces in between any punctuation characters, otherwise the punctuation will be counted as part of the word.
  +
  +
===Tagging the corpus===
  +
  +
Tag both of the files using the apertium-tagger. So, for example for Spanish:
  +
  +
<pre>
  +
$ cat es.txt | apertium-destxt | lt-proc es-it.automorf.bin | apertium-tagger -g es-it.prob | apertium-retxt > es.tagged.txt &
  +
</pre>
  +
  +
After this, strip out the '^' and the '$' symbols from the es-tagged.txt file. This will result in lines that look something like:
  +
  +
<pre>
  +
Reconocer<vblex><ger> que<cnjsub> ,<cm> en<pr> particular<adj><mf><sg> ,<cm> ser<vbser><cni><p3><sg> mutuamente<adv>
  +
beneficioso<adj><m><sg> *cooperar mediante<pr> el<det><def><m><sg> establecimiento<n><m><sg> de<pr>
  +
uno<det><ind><m><sg> programa<n><m><sg> común<adj><mf><sg> de<pr> investigación<n><f><pl> y<cnjcoo>
  +
de<pr> desarrollo<n><m><sg> ;<sent>
  +
</pre>
  +
  +
===Aligning the corpus===
  +
{{see-also|Using GIZA++}}
  +
  +
Use GIZA++ to align both of the files, the instructions can be found in the page [[using GIZA++]]. Once the alignment has been made, you will end up with a file that ends in <code>.A3.final</code>, this is your alignment file. Next you will need to convert the alignment to the LIHLA format that ReTraTos uses. The script on the [[Talk:ReTraTos|talk page]] serves this purpose. For Spanish--Italian, it would be called thusly:
  +
  +
<pre>
  +
$ perl giza_to_lihla.pl es_it.aligned.A3.final ./es/ ./it/
  +
</pre>
  +
  +
This will put two files into the directories <code>./es/</code> and <code>./it/</code> which correspond to the lines in Spanish and Italian respectively. These LIHLA alignment files will end in <code>.al</code> and will look like the following:
  +
  +
<pre>
  +
<s snum=6>Reconocer<vblex><ger>:1 que<cnjsub>:0 ,<cm>:0 en<pr>:2 particular<adj><mf><sg>:3 ,<cm>:0 ser<vbser><cni><p3><sg>:5 mutuamente<adv>:7
  +
beneficioso<adj><m><sg>:8 *cooperar:9 mediante<pr>:10 el<det><def><m><sg>:11 establecimiento<n><m><sg>:12 de<pr>:0 uno<det><ind><m><sg>:13
  +
programa<n><m><sg>:14 común<adj><mf><sg>:15 de<pr>:16 investigación<n><f><pl>:17 y<cnjcoo>:18 de<pr>:19 desarrollo<n><m><sg>:20 ;<sent>:0</s>
  +
</pre>
   
===ReTraTos_lex===
+
===Running ReTraTos_lex===
   
You will need the header and footer of a bilingual dictionary in two separate files, for example, <code>dic_header.txt</code> and <code>dic_footer.txt</code> (see the examples in the package). Example sentences, in the format described above will be in the files <code>en.txt</code> and <code>pl.txt</code>.
+
You will need the header and footer of a bilingual dictionary in two separate files, for example, <code>dic_header.txt</code> and <code>dic_footer.txt</code> (see the examples in the package). Then the program for generating the dictionary (<code>ReTraTos_lex</code>) can be called like this:
   
 
<pre>
 
<pre>
$ ReTraTos_lex -s pt.txt -t en.txt -b dic_header.txt -e dic_footer.txt
+
$ ReTraTos_lex -s ./es/es_it.aligned.A3.final.al -t ./it/es_it.aligned.A3.final.al -b dic_header.txt -e dic_footer.txt
   
 
PRE-PROCESSAMENTO
 
PRE-PROCESSAMENTO
   
Reading the examples ... 100 examples read
+
Reading the examples ... 100000 examples read
Reading the examples ... 100 examples read
+
Reading the examples ... 100000 examples read
   
 
GERANDO LEXICO
 
GERANDO LEXICO
Line 47: Line 93:
 
</pre>
 
</pre>
   
The output file will be <code>ReTraTos_lex_ptXen_1.dix</code>.
+
The output file will be <code>.dix</code> file.
   
 
==See also==
 
==See also==

Revision as of 19:08, 29 March 2008

ReTraTos is a toolbox to build linguistic resources useful for machine translation (MT): bilingual dictionaries and transfer rules. The induction systems and open linguistic data can be used with the Apertium toolbox to build open-source MT systems.

Bilingual dictionaries

This section describes how to use ReTraTos to create a bilingual dictionary for your Apertium language pair. You will need:

  • An aligned corpus of the two languages. For any pair of european languages, the JRC-Acquis corpus is recommended.
  • GIZA++
  • ReTraTos
  • a lot of patience

Preparing the corpus

The corpus should be in two files, each with one sentence per line. For example,

es.txt
Reconociendo que , en particular , sería mutuamente beneficioso cooperar mediante el establecimiento de un programa común de investigaciones y de 
desarrollo ;
Considerando que un acuerdo que establezca una cooperación en el ámbito de las utilizaciones pacíficas de la energía atómica iniciaría fructíferos 
intercambios de experiencia 
...
it.txt
Riconoscendo in particolare che sarebbe loro reciproco vantaggio cooperare con lo stabilire un programma comune di ricerche e di 
sviluppo ;
Considerando che un accordo inteso a stabilire una cooperazione nel campo degli usi pacifici dell'energia atomica darebbe inizio ad un proficuo 
scambio di esperienze
...

Depending on if you want to create entries for proper names, you could lower-case the whole corpus. Make sure that there are spaces in between any punctuation characters, otherwise the punctuation will be counted as part of the word.

Tagging the corpus

Tag both of the files using the apertium-tagger. So, for example for Spanish:

$ cat es.txt | apertium-destxt | lt-proc es-it.automorf.bin | apertium-tagger -g es-it.prob | apertium-retxt > es.tagged.txt &  

After this, strip out the '^' and the '$' symbols from the es-tagged.txt file. This will result in lines that look something like:

Reconocer<vblex><ger> que<cnjsub> ,<cm> en<pr> particular<adj><mf><sg> ,<cm> ser<vbser><cni><p3><sg> mutuamente<adv> 
beneficioso<adj><m><sg> *cooperar mediante<pr> el<det><def><m><sg> establecimiento<n><m><sg> de<pr> 
uno<det><ind><m><sg> programa<n><m><sg> común<adj><mf><sg> de<pr> investigación<n><f><pl> y<cnjcoo> 
de<pr> desarrollo<n><m><sg> ;<sent>

Aligning the corpus

See also: Using GIZA++

Use GIZA++ to align both of the files, the instructions can be found in the page using GIZA++. Once the alignment has been made, you will end up with a file that ends in .A3.final, this is your alignment file. Next you will need to convert the alignment to the LIHLA format that ReTraTos uses. The script on the talk page serves this purpose. For Spanish--Italian, it would be called thusly:

$ perl giza_to_lihla.pl es_it.aligned.A3.final ./es/ ./it/

This will put two files into the directories ./es/ and ./it/ which correspond to the lines in Spanish and Italian respectively. These LIHLA alignment files will end in .al and will look like the following:

<s snum=6>Reconocer<vblex><ger>:1 que<cnjsub>:0 ,<cm>:0 en<pr>:2 particular<adj><mf><sg>:3 ,<cm>:0 ser<vbser><cni><p3><sg>:5 mutuamente<adv>:7 
beneficioso<adj><m><sg>:8 *cooperar:9 mediante<pr>:10 el<det><def><m><sg>:11 establecimiento<n><m><sg>:12 de<pr>:0 uno<det><ind><m><sg>:13 
programa<n><m><sg>:14 común<adj><mf><sg>:15 de<pr>:16 investigación<n><f><pl>:17 y<cnjcoo>:18 de<pr>:19 desarrollo<n><m><sg>:20 ;<sent>:0</s>

Running ReTraTos_lex

You will need the header and footer of a bilingual dictionary in two separate files, for example, dic_header.txt and dic_footer.txt (see the examples in the package). Then the program for generating the dictionary (ReTraTos_lex) can be called like this:

$ ReTraTos_lex -s ./es/es_it.aligned.A3.final.al -t ./it/es_it.aligned.A3.final.al -b dic_header.txt -e dic_footer.txt 

PRE-PROCESSAMENTO

        Reading the examples ...  100000 examples read
        Reading the examples ...  100000 examples read

GERANDO LEXICO

        Generating source-target dictionary ... OK
        Generating target-source dictionary ... OK
        Processing bilingual dictionary ... OK
        Generalizing bilingual dictionary ... OK
        Cleaning equal attributes ... OK

IMPRIMINDO LEXICO

        Printing bilingual dictionary ... OK

The output file will be .dix file.

See also

External links

Further reading

  • Helena M. Caseli, Maria das Graças V. Nunes, Mikel L. Forcada. (2008) "From free shallow monolingual resources to machine translation systems: easing the task", in Mixing Approaches To Machine Translation, MATMT2008, proceedings (Donostia, Spain, Feb. 14, 2008), pp. 41--48
  • Helena M. Caseli, Maria das Graças V. Nunes, Mikel L. Forcada. (2008) "Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation". Machine Translation (to appear)