Difference between revisions of "Aligning a corpus with fast align"
Jump to navigation
Jump to search
Line 14: | Line 14: | ||
<pre> |
<pre> |
||
$ cat /tmp/udhr.kaz | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kaz.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kaz/kaz.rlx.bin | apertium-retxt > /tmp/udhr.kaz.tagged |
$ cat /tmp/udhr.kaz | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kaz.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kaz/kaz.rlx.bin | apertium-retxt > /tmp/udhr.kaz.tagged |
||
− | $ cat /tmp/udhr.kir | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium- |
+ | $ cat /tmp/udhr.kir | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kir/kir.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kir/kir.rlx.bin | apertium-retxt > /tmp/udhr.kir.tagged |
</pre> |
</pre> |
||
Line 28: | Line 28: | ||
<pre> |
<pre> |
||
+ | $ paste /tmp/udhr.kaz.trimmed /tmp/udhr.kir.trimmed | sed 's/ *\t */ ||| /g' > /tmp/udhr.kaz-kir.input |
||
− | |||
− | |||
</pre> |
</pre> |
||
Line 36: | Line 35: | ||
<pre> |
<pre> |
||
+ | $ ./fast_align -d -v -o -i /tmp/udhr.kaz-kir.input > /tmp/udhr.kaz-kir.align |
||
+ | $ ./fast_align -d -v -o -r -i /tmp/udhr.kaz-kir.input > /tmp/udhr.kir-kaz.align |
||
− | |||
</pre> |
</pre> |
||
Revision as of 09:15, 9 December 2015
Contents |
What you need
- A sentence-aligned parallel corpus
- Fast_align (get it here)
- Two apertium language packages
Process
First analyse the corpus with the language packages.
$ cat /tmp/udhr.kaz | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kaz.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kaz/kaz.rlx.bin | apertium-retxt > /tmp/udhr.kaz.tagged $ cat /tmp/udhr.kir | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kir/kir.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kir/kir.rlx.bin | apertium-retxt > /tmp/udhr.kir.tagged
Then remove superfluous tags (for example for lexical alignment, case is not really interesting).
cat /tmp/udhr.kaz.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer > /tmp/udhr.kaz.trimmed cat /tmp/udhr.kir.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer > /tmp/udhr.kir.trimmed
Create the input file for fast_align:
$ paste /tmp/udhr.kaz.trimmed /tmp/udhr.kir.trimmed | sed 's/ *\t */ ||| /g' > /tmp/udhr.kaz-kir.input
Run fast_align:
$ ./fast_align -d -v -o -i /tmp/udhr.kaz-kir.input > /tmp/udhr.kaz-kir.align $ ./fast_align -d -v -o -r -i /tmp/udhr.kaz-kir.input > /tmp/udhr.kir-kaz.align
Symmetrise the alignments: