From Apertium
Jump to: navigation, search

Keeping processes open and using them as daemons (alternatively, using programs as libraries in separate threads) lets you avoid some startup time if you do a lot of short translation requests. This is useful if you want to use Apertium on a web server, where starting up all programs in the pipeline for every single request would lead to a lot of lag.

If you want a pre-packaged daemon solution, see Apertium services. This page gives details on how to use some of the same techniques yourself.


[edit] Background

Apertium is implemented as a set of separate programs, each performing their individual tasks separately, communicating in the Unix pipeline manner.

Each linguistic package contains a modes.xml file, which specifies which programs are invoked, in which order, and specifies the parameters and data files specific to that language pair. Each language pair can contain a number of modes; most of these are used for debugging each stage of the pipeline. At the moment, the modes are converted to a shell script, which is called by the apertium script.

The section #NUL flushing describes running the programs as daemons by controlling flushing: we start each program as usual, but with an argument that tells it to flush output – and not exit – when it sees a NUL byte. We send text to the program with a NUL byte at the end, and get the output back without the program exiting.

The section #Using as libraries describes running the programs as libraries, instead of separate process daemons. Each program is effectively implemented as a library where the main() function sets up the environment, parses arguments, and calls a function which performs the task at hand. So an alternative method to starting lots of separate processes, is to use the programs through the C++ API, as multiple threads, redirecting their C input and output file streams.

[edit] NUL flushing

This section describes how to use the NUL flushing features of apertium to create daemons/server programs that don't quit on receiving input, and how to send input to them and receive output from them.

Note: if you want to use vislcg3 command, it does flushing on seeing a line with <STREAMCMD:FLUSH>, instead of the NUL byte. With cg-proc you still use NUL though.

[edit] Flushing examples in bash

For a minimal example of NUL flushing from the command line, see NUL flushing.

Here's a small set of scripts showing how to do NUL flushing in bash for use in e.g. a web server. We use apertium-nn-nb as an example, but it should work with any language pair; the modules lt-proc/cg-proc/apertium-{tagger,pretransfer,transfer,interchunk,postchunk}/lrx-proc all support NUL flushing.

The trick here is to use named pipes, also known as fifo's. Typically a process writes output to the special file descriptors "standard out" and reads input from "standard in". When you do echo foo > file.txt you redirect standard out to file.txt instead of standard out. But you can also redirect it to a named pipe, which is a sort of virtual file, similar to "standard out". Then you can redirect that named pipe to be the input to another process. (In fact, this is what happens under the hood when you type grep foo file.txt | sed 's/o/l/', except here you get an unnamed pipe between grep and sed.)

We use mkfifo file to create a named pipe/fifo called "file". We can also use exec 3>file to have "file descriptor 3" as input to that named pipe.

This is the "daemon"/server script:

# file: daemon

rm -f to from
mkfifo to from
lt-proc -zwe nn-nb.automorf-no-cp.bin  <to  >from  &  pid=$!
# Open some file descriptors so the fifo's are open for the duration of the lt-proc process:
exec 3>to
exec 4<from
wait $pid

This is the "client" script:

# file: client

exec 3>to
exec 4<from

cat >&3
printf '\0' >&3
awk 'BEGIN{RS="\0"}{printf "%s", $0;exit}' <&4
# With GNU head (e.g. on Ubuntu), you can just use -z for 0-terminated "lines":
#head -1 -z <&4
From the apertium-nn-nb directory, try running ./server in one terminal, then in another terminal, in the same directory, do
echo Det der med krikar og krokar|apertium-destxt |./client

(or run ./server & and then the above command in the same terminal)

[edit] vislcg3 CG format

If you want vislcg3 to flush, you have to insert <STREAMCMD:FLUSH> instead of NUL in the stream. The client then becomes something like

# file: vislcg3-client

exec 3>to
exec 4<from

cat >&3
echo '<STREAMCMD:FLUSH>' >&3
awk '{print} /^<STREAMCMD:FLUSH>$/{exit}' <&4

while the server is essentially the same.

[edit] Chaining several daemons

To improve upon the server/daemon script, we can let it take the program to start as an argument, and some identifier for the fifo's to keep them unique. This will let us start several daemons and chain them. To keep things clean, we put it all in one script and use bash functions.

We make one function for the client, one function to initialise the fifo's, and one function—which can be backgrounded—to start the server (we can't create the fifo's in that backgrounded function, as this would lead to a race condition with the client trying to access the fifo's before they exist).


msg () {
    # Print a debug message to stderr
    echo '{{{' "$@" '}}}' 1>&2

server_setup () {
    # Set up fifo's to handle input and output to our daemon, argument
    # 1 is used as an id, unique to this daemon:
    local to="$1.to" from="$1.from"
    rm -f "${to}" "${from}"
    mkfifo "${to}" "${from}"
    msg "${to} and ${from} fifo's set up"

server_start () {
    # Usage: server_start ID PROGRAM [arg]...
    # The id is the same as the one used in server_setup. The other
    # args are as you would run the program normally.
    local to="$1.to" from="$1.from"
    if ! [[ -p "${to}" && -p "${from}" ]]; then
	msg "Server not set up yet? Expected ${to} and ${from} to be named pipes."
	return 1

    shift  # the rest of the args are the executable and its arguments

    "$@"  <"${to}"  >"${from}"  &  pid=$!

    # Ensure the fifo's are open for the duration of the process:
    exec 3>"${to}"
    exec 4<"${from}"
    msg "${to} and ${from} server started"
    wait $pid

client () {
    # Sends input to the daemon with id of argument 1
    local to="$1.to" from="$1.from"
    if ! [[ -p "${to}" && -p "${from}" ]]; then
	msg "Server not started yet? Expected ${to} and ${from} to be named pipes."
	return 1

    exec 3>"${to}"
    exec 4<"${from}"

    cat >&3
    echo -e '\0' >&3
    while read -rd '' <&4; do
	echo -n "$REPLY"

To test these functions, save the script as adaemons.sh and try the following:

source adaemons.sh


server_setup morph
server_start morph lt-proc -zwe "${path_to_pair}"/nn-nb.automorf-no-cp.bin &

echo "Det der med krikar og krokar"| apertium-destxt | client morph
echo "Eg veit jo kva ein krok er,"| apertium-destxt | client morph
echo "men kva er ein krik?" | apertium-destxt | client morph
echo "Munk peikte på ein krik." | apertium-destxt | client morph
echo "– Det er ein krik, sa han." | apertium-destxt | client morph

server_setup cg
server_start cg cg-proc -zw "${path_to_pair}"/nn-nb.rlx.bin &

echo "Det der med krikar og krokar"| apertium-destxt | client morph | client cg
echo "Eg veit jo kva ein krok er,"| apertium-destxt | client morph | client cg

echo '– ^Det/den<det><dem><nt><sg>/det<prn><p3><nt><sg><nom>/det<prn><p3><nt><sg><acc>$ ^er/vere<vblex><pres>$ ^ein/ein<prn><sg>/eine<vblex><imp>/ein<det><qnt><m><sg>/ein<det><qnt><m><sg><ind>$ ^krik/krik<n><m><sg><ind>$^,/,<cm>/,<cm><clb>$ ^sa/seie<vblex><pret>$ ^han/han<prn><p3><m><sg><nom>/han<prn><p3><m><sg><acc>$^./.<sent><clb>$' | client cg

You can also put those commands (except the first "source" command) at the end of adaemons.sh and run it like bash adaemons.sh.

[edit] Allowing longer input

If the input is too large, it'll overrun the fifo buffer (often about 4kb) and the daemons will hang. You can avoid this by splitting large input into chunks small enough to fit in the buffer.

TODO write example of wrapper around "client" function

[edit] Flushing examples in python3

Here's a small example in Python 3, using just two programs and hardcoded paths for simplicity:

#!/usr/bin/env python3
import os
from subprocess import Popen, PIPE

class translator():
    def __init__(self):
        self.pipeBeg = Popen(["lt-proc", "-z", "-e", "-w", "/l/n/nn-nb.automorf-no-cp.bin"], stdin=PIPE, stdout=PIPE)
        self.pipeEnd = Popen(["cg-proc", "-z", "-w", "/l/n/nn-nb.rlx.bin"], stdin=self.pipeBeg.stdout, stdout=PIPE)

    def translate(self, string):
        bstring = string
        if type(string) == type(''): bstring = bytes(string, 'utf-8')


        char = self.pipeEnd.stdout.read(1)
        output = []
        while char and char != b'\0':
            char = self.pipeEnd.stdout.read(1)

        return b"".join(output)

t = translator()

print(t.translate("Det der med krikar og krokar[][\n]").decode('utf-8'))
print(t.translate("Eg veit jo kva ein krok er,[][\n]").decode('utf-8'))
print(t.translate("men kva er ein krik?[][\n]").decode('utf-8'))
print(t.translate("Munk peikte på ein krik.[][\n]").decode('utf-8'))
print(t.translate("– Det er ein krik, sa han.[][\n]").decode('utf-8'))

The init function builds up the pipeline. Note the -z argument is given to both lt-proc and cg-proc – the other arguments are the ones used in the modes.xml of the language pair.

The [][\n] is needed because the tools expect the NUL byte to appear after a blank (if you try without it, you'll see the last word get swallowed up). In a real example, you should run natural language text through apertium-destxt (or -deshtml) first. The formatters have a low enough startup time that you don't need to keep their processes open.

Below is an example wrapper function does the deformatting and reformatting, use it in the same class as the above example:

    def translateHTML(self, string):
        bstring = string
        if type(string) == type(''): bstring = bytes(string, 'utf-8')

        deformat = Popen("apertium-deshtml", stdin=PIPE, stdout=PIPE)

        translated = self.translate(deformat.communicate()[0])

        reformat = Popen("apertium-rehtml", stdin=PIPE, stdout=PIPE)
        return reformat.communicate()[0]

# […]

print(t.translateHTML("– <em>Det</em> er ein krik, sa han").decode('utf-8'))

[edit] Allowing longer input

If the input is too large, it'll overrun the fifo buffer (often about 4kb) and the daemons will hang. You can avoid this by splitting large input into chunks small enough to fit in the buffer.

TODO write example of wrapper around "translate" function

[edit] Using as libraries

Apertium-service runs apertium language pairs by using lttoolbox, apertium-transfer etc. as libraries. This requires each apertium program to make itself available as a library.

The way Apertium-service does this is by redirecting the C FILE pointers from stdin/stdout to new file descriptors. This is in fact conceptually very similar what happens with the #NUL flushing method, but requires less system resources since there are no new processes started, only new threads.

[edit] History

User:Wynand.winterbach, who developed apertium-dbus, started work towards daemon-like operation, adding the NUL flush option to lt-proc.

User:Deadbeef added support for using lttoolbox and apertium as libraries during GsoC 2009.

[edit] Ideas

  • Transfer, interchunk, and postchunk should reread the variables section (optimally, caching the location in the XML file on the first read).
  • Reuse thread.c from memcached to handle worker threads
  • Read the modes.xml file directly, and generate the pipeline from it
  • Where possible, link to the apertium functions directly, rather than spawning separate processes (though that will still be required by some language modes)
  • Add a sentence splitter: preferably with SRX support (to allow for translation caching)
  • Make the deformatters work as libraries

[edit] See also

Personal tools