Difference between revisions of "Aligning a corpus with fast align"
Jump to navigation
Jump to search
m (Unhammer moved page Alining a corpus with fast align to Aligning a corpus with fast align) |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 7: | Line 7: | ||
* Fast_align (get it [https://github.com/clab/fast_align here]) |
* Fast_align (get it [https://github.com/clab/fast_align here]) |
||
* Two apertium language packages |
* Two apertium language packages |
||
+ | * fast_align merge script (get it [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/ordlist-fra-fastalign.py here]) |
||
==Process== |
==Process== |
||
+ | |||
+ | :''Warning'': On Mac, you will need to use GNU alternatives of <code>sed</code>, <code>awk</code> etc. |
||
First analyse the corpus with the language packages. |
First analyse the corpus with the language packages. |
||
Line 20: | Line 23: | ||
<pre> |
<pre> |
||
− | cat /tmp/udhr.kaz.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer > /tmp/udhr.kaz.trimmed |
+ | cat /tmp/udhr.kaz.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\$/$ /g' | sed 's/ */ /g' | apertium-pretransfer > /tmp/udhr.kaz.trimmed |
− | cat /tmp/udhr.kir.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | apertium-pretransfer > /tmp/udhr.kir.trimmed |
+ | cat /tmp/udhr.kir.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\$/$ /g' | sed 's/ */ /g' | apertium-pretransfer > /tmp/udhr.kir.trimmed |
</pre> |
</pre> |
||
Line 40: | Line 43: | ||
</pre> |
</pre> |
||
+ | <!-- |
||
Symmetrise the alignments: |
Symmetrise the alignments: |
||
<pre> |
<pre> |
||
+ | </pre> |
||
+ | --> |
||
+ | |||
+ | Create the wordlist: |
||
+ | |||
+ | <pre> |
||
+ | |||
+ | $ python3 ~/scripts/ordlist-fra-fastalign.py /tmp/udhr.kaz-kir.input /tmp/udhr.kaz-kir.align /tmp/udhr.kir-kaz.align |
||
+ | </pre> |
||
+ | |||
+ | |||
+ | |||
+ | To symmetrise alignments: |
||
+ | |||
+ | <pre> |
||
+ | $ ./atools -i /tmp/udhr.kaz-kir.align -j /tmp/udhr.kir-kaz.align -c grow-diag-final-and |
||
</pre> |
</pre> |
||
Latest revision as of 13:05, 21 January 2016
Contents |
What you need[edit]
- A sentence-aligned parallel corpus
- Fast_align (get it here)
- Two apertium language packages
- fast_align merge script (get it here)
Process[edit]
- Warning: On Mac, you will need to use GNU alternatives of
sed
,awk
etc.
First analyse the corpus with the language packages.
$ cat /tmp/udhr.kaz | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kaz/kaz.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kaz/kaz.rlx.bin | apertium-retxt > /tmp/udhr.kaz.tagged $ cat /tmp/udhr.kir | apertium-destxt | lt-proc -w ~/source/apertium/languages/apertium-kir/kir.automorf.bin | cg-proc -n -1 ~/source/apertium/languages/apertium-kir/kir.rlx.bin | apertium-retxt > /tmp/udhr.kir.tagged
Then remove superfluous tags (for example for lexical alignment, case is not really interesting).
cat /tmp/udhr.kaz.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\$/$ /g' | sed 's/ */ /g' | apertium-pretransfer > /tmp/udhr.kaz.trimmed cat /tmp/udhr.kir.tagged | sed 's/\(<\(n\|adj\|adv\)>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\(<v><tv>\|<v><iv>\)\(<[^>]\+>\)\+/\1/g' | sed 's/\$/$ /g' | sed 's/ */ /g' | apertium-pretransfer > /tmp/udhr.kir.trimmed
Create the input file for fast_align:
$ paste /tmp/udhr.kaz.trimmed /tmp/udhr.kir.trimmed | sed 's/ *\t */ ||| /g' > /tmp/udhr.kaz-kir.input
Run fast_align:
$ ./fast_align -d -v -o -i /tmp/udhr.kaz-kir.input > /tmp/udhr.kaz-kir.align $ ./fast_align -d -v -o -r -i /tmp/udhr.kaz-kir.input > /tmp/udhr.kir-kaz.align
Create the wordlist:
$ python3 ~/scripts/ordlist-fra-fastalign.py /tmp/udhr.kaz-kir.input /tmp/udhr.kaz-kir.align /tmp/udhr.kir-kaz.align
To symmetrise alignments:
$ ./atools -i /tmp/udhr.kaz-kir.align -j /tmp/udhr.kir-kaz.align -c grow-diag-final-and