Daemon
Keeping processes open and using them as daemons lets you avoid some startup time if you do a lot of short translation requests. This is useful if you want to use Apertium on a web server, where starting up all programs in the pipeline for every single request would lead to a lot of lag.
If you want a pre-packaged daemon solution, see Apertium services. This page gives details on how to use some of the same techniques yourself.
Background
Apertium is implemented as a set of separate programs, each performing their individual tasks separately, communicating in the Unix pipeline manner.
Each linguistic package contains a modes.xml file, which specifies which programs are invoked, in which order, and specifies the parameters and datafiles specific to that language pair. Each language pair can contain a number of modes; most of these are used for debugging each stage of the pipeline. At the moment, the modes are converted to a shell script, which is called by the apertium script.
Each program is also effectively implemented as a library; the main() function sets up the environment, parses arguments, and calls a function which performs the task at hand. One method of keeping daemons would be to call the programs through a C++ API, this has not been implemented yet though.
The method described below uses NUL flushing: we start each program as usual, but with an argument that tells it to flush output – and not exit – when it sees a NUL byte. We each text to the program with a NUL byte at the end, and get the output back without the program exiting.
NUL flushing
This section describes how to use the NUL flushing features.
bash
Here's a small set of scripts showing how to do NUL flushing in bash, using apertium-nn-nb as an example.
This is the "daemon"/server script:
#!/bin/bash # file: daemon rm -f to from mkfifo to from while true; do lt-proc -zwe nn-nb.automorf-no-cp.bin <to >from & pid=$! # Ensure the fifo's are open for the duration of the lt-proc process: exec 3>to exec 4<from wait $pid echo "restarting..." 1>&2 done
This is the "client" script:
#!/bin/bash # file: client exec 3>to exec 4<from cat >&3 echo -e '\0' >&3 while read -rd '' <&4; do echo -n "$REPLY"; break; done
From the apertium-nn-nb directory, try running ./server in one terminal, then in another terminal, in the same directory, do
echo Det der med krikar og krokar|apertium-destxt |./client
python3
Here's a small example in Python 3, using just two programs and hardcoded paths for simplicity:
#!/usr/bin/env python3 import os from subprocess import Popen, PIPE class translator(): def __init__(self): self.pipeBeg = Popen(["lt-proc", "-z", "-e", "-w", "/l/n/nn-nb.automorf-no-cp.bin"], stdin=PIPE, stdout=PIPE) self.pipeEnd = Popen(["cg-proc", "-z", "-w", "/l/n/nn-nb.rlx.bin"], stdin=self.pipeBeg.stdout, stdout=PIPE) def translate(self, string): bstring = string if type(string) == type(''): bstring = bytes(string, 'utf-8') self.pipeBeg.stdin.write(bstring) self.pipeBeg.stdin.write(b'\0') self.pipeBeg.stdin.flush() char = self.pipeEnd.stdout.read(1) output = [] while char and char != b'\0': output.append(char) char = self.pipeEnd.stdout.read(1) return b"".join(output) t = translator() print(t.translate("Det der med krikar og krokar[][\n]").decode('utf-8')) print(t.translate("Eg veit jo kva ein krok er,[][\n]").decode('utf-8')) print(t.translate("men kva er ein krik?[][\n]").decode('utf-8')) print(t.translate("Munk peikte på ein krik.[][\n]").decode('utf-8')) print(t.translate("– Det er ein krik, sa han.[][\n]").decode('utf-8'))
The init function builds up the pipeline. Note the -z argument is given to both lt-proc and cg-proc – the other arguments are the ones used in the modes.xml of the language pair.
The [][\n] is needed because the tools expect the NUL byte to appear after a blank (if you try without it, you'll see the last word get swallowed up). In a real example, you should run natural language text through apertium-destxt (or -deshtml) first. The formatters have a low enough startup time that you don't need to keep their processes open.
Below is an example wrapper function does the deformatting and reformatting:
def translateHTML(self, string): bstring = string if type(string) == type(''): bstring = bytes(string, 'utf-8') deformat = Popen("apertium-deshtml", stdin=PIPE, stdout=PIPE) deformat.stdin.write(bstring) translated = self.translate(deformat.communicate()[0]) reformat = Popen("apertium-rehtml", stdin=PIPE, stdout=PIPE) reformat.stdin.write(translated) return reformat.communicate()[0] # […] print(t.translateHTML("– <em>Det</em> er ein krik, sa han").decode('utf-8'))
History
User:Wynand.winterbach, who developed apertium-dbus, started work towards daemon-like operation, adding the NUL flush option to lt-proc. User:Deadbeef added options to other apertium programs during GsoC 2009.
Ideas
- Transfer, interchunk, and postchunk should reread the variables section (optimally, caching the location in the XML file on the first read).
- Reuse thread.c from memcached to handle worker threads
- Read the modes.xml file directly, and generate the pipeline from it
- apertium-apy does this
- Where possible, link to the apertium functions directly, rather than spawning separate processes (though that will still be required by some language modes)
- Add a sentence splitter: preferably with SRX support (to allow for translation caching)
- Make the deformatters work as libraries