Daemon

From Apertium
Revision as of 14:53, 15 January 2014 by Unhammer (talk | contribs) (apy)
Jump to navigation Jump to search

Keeping processes open and using them as daemons lets you avoid some startup time if you do a lot of short translation requests. This is useful if you want to use Apertium on a web server, where starting up all programs in the pipeline for every single request would lead to a lot of lag.

If you want a pre-packaged daemon solution, see Apertium services. This page gives details on how to use some of the same techniques yourself.

Background

Apertium is implemented as a set of separate programs, each performing their individual tasks separately, communicating in the Unix pipeline manner.

Each linguistic package contains a modes.xml file, which specifies which programs are invoked, in which order, and specifies the parameters and datafiles specific to that language pair. Each language pair can contain a number of modes; most of these are used for debugging each stage of the pipeline. At the moment, the modes are converted to a shell script, which is called by the apertium script.

Each program is also effectively implemented as a library; the main() function sets up the environment, parses arguments, and calls a function which performs the task at hand. One method of keeping daemons would be to call the programs through a C++ API, this has not been implemented yet though.

The method described below uses NUL flushing: we start each program as usual, but with an argument that tells it to flush output – and not exit – when it sees a NUL byte. We each text to the program with a NUL byte at the end, and get the output back without the program exiting.

NUL flushing

This section describes how to use the NUL flushing features.

bash

Here's a small set of scripts showing how to do NUL flushing in bash, using apertium-nn-nb as an example.

This is the "daemon"/server script:

#!/bin/bash
# file: daemon

rm -f to from
mkfifo to from
while true; do
  lt-proc -zwe nn-nb.automorf-no-cp.bin & pid=$!
  # Ensure the fifo's are open for the duration of the lt-proc process:
  exec 3>to; exec 4<from
  wait $pid
  echo "restarting..." 1>&2
done

This is the "client" script:

#!/bin/bash
# file: client

exec 3>to
exec 4<from

cat >&3
echo -e '\0' >&3
while read -rd '' <&4; do echo -n "$REPLY"; break; done

python3

Here's a small example in Python 3, using just two programs and hardcoded paths for simplicity:

#!/usr/bin/env python3
import os
from subprocess import Popen, PIPE

class translator():
    def __init__(self):
        self.pipeBeg = Popen(["lt-proc", "-z", "-e", "-w", "/l/n/nn-nb.automorf-no-cp.bin"], stdin=PIPE, stdout=PIPE)
        self.pipeEnd = Popen(["cg-proc", "-z", "-w", "/l/n/nn-nb.rlx.bin"], stdin=self.pipeBeg.stdout, stdout=PIPE)

    def translate(self, string):
        bstring = string
        if type(string) == type(''): bstring = bytes(string, 'utf-8')

        self.pipeBeg.stdin.write(bstring)
        self.pipeBeg.stdin.write(b'\0')
        self.pipeBeg.stdin.flush()

        char = self.pipeEnd.stdout.read(1)
        output = []
        while char and char != b'\0':
            output.append(char)
            char = self.pipeEnd.stdout.read(1)

        return b"".join(output)

t = translator()

print(t.translate("Det der med krikar og krokar[][\n]").decode('utf-8'))
print(t.translate("Eg veit jo kva ein krok er,[][\n]").decode('utf-8'))
print(t.translate("men kva er ein krik?[][\n]").decode('utf-8'))
print(t.translate("Munk peikte på ein krik.[][\n]").decode('utf-8'))
print(t.translate("– Det er ein krik, sa han.[][\n]").decode('utf-8'))

The init function builds up the pipeline. Note the -z argument is given to both lt-proc and cg-proc – the other arguments are the ones used in the modes.xml of the language pair.

The [][\n] is needed because the tools expect the NUL byte to appear after a blank (if you try without it, you'll see the last word get swallowed up). In a real example, you should run natural language text through apertium-destxt (or -deshtml) first. The formatters have a low enough startup time that you don't need to keep their processes open.

Below is an example wrapper function does the deformatting and reformatting:

    def translateHTML(self, string):
        bstring = string
        if type(string) == type(''): bstring = bytes(string, 'utf-8')

        deformat = Popen("apertium-deshtml", stdin=PIPE, stdout=PIPE)
        deformat.stdin.write(bstring)

        translated = self.translate(deformat.communicate()[0])

        reformat = Popen("apertium-rehtml", stdin=PIPE, stdout=PIPE)
        reformat.stdin.write(translated)
        return reformat.communicate()[0]

# […]

print(t.translateHTML("– <em>Det</em> er ein krik, sa han").decode('utf-8'))

History

User:Wynand.winterbach, who developed apertium-dbus, started work towards daemon-like operation, adding the NUL flush option to lt-proc. User:Deadbeef added options to other apertium programs during GsoC 2009.

Ideas

  • Transfer, interchunk, and postchunk should reread the variables section (optimally, caching the location in the XML file on the first read).
  • Reuse thread.c from memcached to handle worker threads
  • Read the modes.xml file directly, and generate the pipeline from it
  • Where possible, link to the apertium functions directly, rather than spawning separate processes (though that will still be required by some language modes)
  • Add a sentence splitter: preferably with SRX support (to allow for translation caching)
  • Make the deformatters work as libraries

See also