Difference between revisions of "Somewhere"

From Apertium
Jump to navigation Jump to search
(Created page with "this should go somewhere on the wiki We should treat NUL as hard separators – if we don't, apertium-apy (and thus www.apertium.org) will risk sending output meant for perso...")
 
 
Line 9: Line 9:
   
 
If we at least handle NUL's correctly in lt-proc and cg-proc,
 
If we at least handle NUL's correctly in lt-proc and cg-proc,
you can turn linebreak's into NUL's (first deleting any existing NUL's
+
you can append NUL's after linebreaks (first deleting any existing NUL's
 
in the corpus) and tag with the -z option to lt-/cg-proc:
 
in the corpus) and tag with the -z option to lt-/cg-proc:
   
   
cat corpus.txt \
+
cat corpus.txt \
| tr -d '\0' \
+
| tr -d '\0' \
| tr '\n' '\0' \
+
| apertium-deshtml -n \
  +
| sed 's/\[$/[][/; s/^]/]\x00/' \
| apertium-deshtml -n \
 
| lt-proc -z -w 'apertium-tat/tat.automorf.bin' \
+
| lt-proc -z -w 'tat.automorf.bin' \
| cg-proc -z 'apertium-tat/tat.rlx.bin' \
+
| cg-proc -z -w -1 'tat.rlx.bin' \
 
| tr -d '\0' \
| cg-proc -z -w -1 'apertium-tat/dev/mansur.bin' \
 
| tr '\0' '\n' \
 
 
| apertium-rehtml-noent
 
| apertium-rehtml-noent
   
   
  +
… finally deleting the NUL's. (Note that NUL's have to appear after the string "[][\n]" for tools like lt-proc to handle them correctly.)
… finally turning NUL's back into newlines.
 
   
 
similarly for full pipelines
 
similarly for full pipelines

Latest revision as of 09:09, 8 November 2018

this should go somewhere on the wiki

We should treat NUL as hard separators – if we don't, apertium-apy (and thus www.apertium.org) will risk sending output meant for person1 to person2. (I have an inkling there might still be bugs in apertium-transfer related to this.)

This also means we should be able to treat NUL's as "record separators" when e.g. translating a corpus of individual sentences, where we don't want one sentence to affect the translation of the next sentence.

If we at least handle NUL's correctly in lt-proc and cg-proc, you can append NUL's after linebreaks (first deleting any existing NUL's in the corpus) and tag with the -z option to lt-/cg-proc:


   cat corpus.txt                     \
   | tr -d '\0'                       \
   | apertium-deshtml -n              \
   | sed 's/\[$/[][/; s/^]/]\x00/'    \
   | lt-proc -z -w 'tat.automorf.bin' \
   | cg-proc -z -w -1 'tat.rlx.bin'   \
   | tr -d '\0'                       \
   | apertium-rehtml-noent


… finally deleting the NUL's. (Note that NUL's have to appear after the string "[][\n]" for tools like lt-proc to handle them correctly.)

similarly for full pipelines

(unfortunately, /usr/bin/apertium can't add the -z's for you, you'll have to grab the pipeline from modes/foo-bar.mode and insert -z's yourself)