Difference between revisions of "Somewhere"
(Created page with "this should go somewhere on the wiki We should treat NUL as hard separators – if we don't, apertium-apy (and thus www.apertium.org) will risk sending output meant for perso...") |
|||
Line 9: | Line 9: | ||
If we at least handle NUL's correctly in lt-proc and cg-proc, |
If we at least handle NUL's correctly in lt-proc and cg-proc, |
||
you can |
you can append NUL's after linebreaks (first deleting any existing NUL's |
||
in the corpus) and tag with the -z option to lt-/cg-proc: |
in the corpus) and tag with the -z option to lt-/cg-proc: |
||
cat corpus.txt |
cat corpus.txt \ |
||
| tr -d '\0' |
| tr -d '\0' \ |
||
| |
| apertium-deshtml -n \ |
||
| sed 's/\[$/[][/; s/^]/]\x00/' \ |
|||
⚫ | |||
| lt-proc -z -w ' |
| lt-proc -z -w 'tat.automorf.bin' \ |
||
| cg-proc -z ' |
| cg-proc -z -w -1 'tat.rlx.bin' \ |
||
⚫ | |||
| cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' \ |
|||
| tr '\0' '\n' \ |
|||
| apertium-rehtml-noent |
| apertium-rehtml-noent |
||
… finally deleting the NUL's. (Note that NUL's have to appear after the string "[][\n]" for tools like lt-proc to handle them correctly.) |
|||
… finally turning NUL's back into newlines. |
|||
similarly for full pipelines |
similarly for full pipelines |
Latest revision as of 09:09, 8 November 2018
this should go somewhere on the wiki
We should treat NUL as hard separators – if we don't, apertium-apy (and thus www.apertium.org) will risk sending output meant for person1 to person2. (I have an inkling there might still be bugs in apertium-transfer related to this.)
This also means we should be able to treat NUL's as "record separators" when e.g. translating a corpus of individual sentences, where we don't want one sentence to affect the translation of the next sentence.
If we at least handle NUL's correctly in lt-proc and cg-proc, you can append NUL's after linebreaks (first deleting any existing NUL's in the corpus) and tag with the -z option to lt-/cg-proc:
cat corpus.txt \ | tr -d '\0' \ | apertium-deshtml -n \ | sed 's/\[$/[][/; s/^]/]\x00/' \ | lt-proc -z -w 'tat.automorf.bin' \ | cg-proc -z -w -1 'tat.rlx.bin' \ | tr -d '\0' \ | apertium-rehtml-noent
… finally deleting the NUL's. (Note that NUL's have to appear after the string "[][\n]" for tools like lt-proc to handle them correctly.)
similarly for full pipelines
(unfortunately, /usr/bin/apertium can't add the -z's for you, you'll have to grab the pipeline from modes/foo-bar.mode and insert -z's yourself)