Difference between revisions of "Aligning a corpus with fast align"

From Apertium
Jump to navigation Jump to search
(Created page with "{{TOCD}} ==What you need== * A sentence-aligned parallel corpus * Fast_align (get it [https://github.com/clab/fast_align here]) * Two apertium language packages ==Process=...")
 
Line 13: Line 13:
   
 
<pre>
 
<pre>
  +
$ cat /tmp/udhr.kaz | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kaz.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kaz/kaz.rlx.bin | apertium-retxt > /tmp/udhr.kaz.tagged
cat
 
  +
$ cat /tmp/udhr.kir | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kir.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kir/kir.rlx.bin | apertium-retxt > /tmp/udhr.kir.tagged
 
 
</pre>
 
</pre>
   
Line 20: Line 20:
   
 
<pre>
 
<pre>
  +
cat /tmp/udhr.kaz.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer > /tmp/udhr.kaz.trimmed
cat | sed 's/<\(nom\|acc\|gen\)>//g' >
 
  +
cat /tmp/udhr.kir.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer > /tmp/udhr.kir.trimmed
  +
 
</pre>
 
</pre>
   

Revision as of 09:11, 9 December 2015


What you need

  • A sentence-aligned parallel corpus
  • Fast_align (get it here)
  • Two apertium language packages

Process

First analyse the corpus with the language packages.

$ cat /tmp/udhr.kaz | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kaz.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kaz/kaz.rlx.bin | apertium-retxt > /tmp/udhr.kaz.tagged
$ cat /tmp/udhr.kir | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kir.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kir/kir.rlx.bin | apertium-retxt > /tmp/udhr.kir.tagged

Then remove superfluous tags (for example for lexical alignment, case is not really interesting).

cat /tmp/udhr.kaz.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer  > /tmp/udhr.kaz.trimmed
cat /tmp/udhr.kir.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer  > /tmp/udhr.kir.trimmed

Create the input file for fast_align:



Run fast_align:




Symmetrise the alignments: