Difference between revisions of "Daemon"

From Apertium
Jump to navigation Jump to search
(Category:Documentation in English)
Line 1: Line 1:
  +
Keeping processes open and using them as '''daemons''' lets you avoid some startup time if you do a lot of short translation requests. This is useful if you want to use Apertium on a web server, where starting up all programs in the pipeline for every single request would lead to a lot of lag.
One of our GSoC project ideas is for a '''daemon mode'''; this page collects ideas and suggestions to the potential implementor of this.
 
   
  +
If you want a pre-packaged daemon solution, see [[Apertium services]]. This page gives details on how to use some of the same techniques yourself.
''Note'': this is way out of date. See [[Apertium services]] instead.
 
   
 
== Background ==
 
== Background ==
   
Apertium is implemented as a set of separate programs, each performing their individual tasks separately, communicating in the usual Unix pipeline manner.
+
Apertium is implemented as a set of separate programs, each performing their individual tasks separately, communicating in the Unix pipeline manner.
   
Each linguistic package contains a 'modes' XML file, which specifies which programs are invoked, in which order, and specifies the parameters and datafiles specific to that language pair. Each language pair can contain a number of modes; most of these are used for debugging each stage of the pipeline. At the moment, the modes are converted to a shell script, which is called by the <tt>apertium</tt> script.
+
Each linguistic package contains a <tt>modes.xml</tt> file, which specifies which programs are invoked, in which order, and specifies the parameters and datafiles specific to that language pair. Each language pair can contain a number of modes; most of these are used for debugging each stage of the pipeline. At the moment, the modes are converted to a shell script, which is called by the <tt>apertium</tt> script.
   
Each program is effectively implemented as a library; the <tt>main()</tt> function sets up the environment, parses arguments, and calls a function which performs the task at hand.
+
Each program is also effectively implemented as a library; the <tt>main()</tt> function sets up the environment, parses arguments, and calls a function which performs the task at hand. One method of keeping daemons would be to call the programs through a C++ API, this has not been implemented yet though.
   
  +
The method described below uses NUL flushing: we start each program as usual, but with an argument that tells it to flush output – and not exit – when it sees a NUL byte. We each text to the program with a NUL byte at the end, and get the output back without the program exiting.
Apertium's pipeline approach is extremely flexible, allowing new modules to be added to the system easily, but this implementation can be quite resource intensive, especially when Apertium is being used as the translation backend on a server.
 
   
== Work to date ==
+
== NUL flushing ==
   
  +
This section describes how to use the NUL flushing features.
Wynand, who developed apertium-dbus, started work towards daemon-like operation. <tt>lt-proc</tt> has a 'null flush' feature; this allows it to remain running, flushing its buffers when it receives a null character. A similar feature would need to be added to the rest of the programs in the pipeline. In addition, transfer, interchunk, and postchunk would need to reread the variables section (optimally, caching the location in the XML file on the first read).
 
   
== Recommendations ==
+
=== bash ===
  +
Here's a small set of scripts showing how to do NUL flushing in bash, using apertium-nn-nb as an example.
   
  +
This is the "daemon"/server script:
  +
<pre>
  +
#!/bin/bash
  +
# file: daemon
  +
  +
rm -f to from
  +
mkfifo to from
  +
while true; do
  +
lt-proc -zwe nn-nb.automorf-no-cp.bin & pid=$!
  +
# Ensure the fifo's are open for the duration of the lt-proc process:
  +
exec 3>to; exec 4<from
  +
wait $pid
  +
echo "restarting..." 1>&2
  +
done
  +
</pre>
  +
  +
This is the "client" script:
  +
<pre>
  +
#!/bin/bash
  +
# file: client
  +
  +
exec 3>to
  +
exec 4<from
  +
  +
cat >&3
  +
echo -e '\0' >&3
  +
while read -rd '' <&4; do echo -n "$REPLY"; break; done
  +
</pre>
  +
  +
=== python3 ===
  +
  +
Here's a small example in Python 3, using just two programs and hardcoded paths for simplicity:
  +
  +
<pre>
  +
#!/usr/bin/env python3
  +
import os
  +
from subprocess import Popen, PIPE
  +
  +
class translator():
  +
def __init__(self):
  +
self.pipeBeg = Popen(["lt-proc", "-z", "-e", "-w", "/l/n/nn-nb.automorf-no-cp.bin"], stdin=PIPE, stdout=PIPE)
  +
self.pipeEnd = Popen(["cg-proc", "-z", "-w", "/l/n/nn-nb.rlx.bin"], stdin=self.pipeBeg.stdout, stdout=PIPE)
  +
  +
def translate(self, string):
  +
bstring = string
  +
if type(string) == type(''): bstring = bytes(string, 'utf-8')
  +
  +
self.pipeBeg.stdin.write(bstring)
  +
self.pipeBeg.stdin.write(b'\0')
  +
self.pipeBeg.stdin.flush()
  +
  +
char = self.pipeEnd.stdout.read(1)
  +
output = []
  +
while char and char != b'\0':
  +
output.append(char)
  +
char = self.pipeEnd.stdout.read(1)
  +
  +
return b"".join(output)
  +
  +
t = translator()
  +
  +
print(t.translate("Det der med krikar og krokar[][\n]").decode('utf-8'))
  +
print(t.translate("Eg veit jo kva ein krok er,[][\n]").decode('utf-8'))
  +
print(t.translate("men kva er ein krik?[][\n]").decode('utf-8'))
  +
print(t.translate("Munk peikte på ein krik.[][\n]").decode('utf-8'))
  +
print(t.translate("– Det er ein krik, sa han.[][\n]").decode('utf-8'))
  +
</pre>
  +
  +
The init function builds up the pipeline. Note the -z argument is given to both lt-proc and cg-proc – the other arguments are the ones used in the modes.xml of the language pair.
  +
  +
The <tt>[][\n]</tt> is needed because the tools expect the NUL byte to appear after a blank (if you try without it, you'll see the last word get swallowed up). In a real example, you should run natural language text through apertium-destxt (or -deshtml) first. The [[formatters]] have a low enough startup time that you don't need to keep their processes open.
  +
  +
Below is an example wrapper function does the deformatting and reformatting:
  +
<pre>
  +
def translateHTML(self, string):
  +
bstring = string
  +
if type(string) == type(''): bstring = bytes(string, 'utf-8')
  +
  +
deformat = Popen("apertium-deshtml", stdin=PIPE, stdout=PIPE)
  +
deformat.stdin.write(bstring)
  +
  +
translated = self.translate(deformat.communicate()[0])
  +
  +
reformat = Popen("apertium-rehtml", stdin=PIPE, stdout=PIPE)
  +
reformat.stdin.write(translated)
  +
return reformat.communicate()[0]
  +
  +
# […]
  +
  +
print(t.translateHTML("– <em>Det</em> er ein krik, sa han").decode('utf-8'))
  +
</pre>
  +
  +
== History ==
  +
  +
[[User:Wynand.winterbach]], who developed apertium-dbus, started work towards daemon-like operation, adding the NUL flush option to <tt>lt-proc</tt>. [[User:Deadbeef]] added options to other apertium programs during [[GsoC]] 2009.
  +
  +
== Ideas ==
  +
  +
* Transfer, interchunk, and postchunk should reread the variables section (optimally, caching the location in the XML file on the first read).
 
* Reuse <tt>thread.c</tt> from [http://www.danga.com/memcached/ memcached] to handle worker threads
 
* Reuse <tt>thread.c</tt> from [http://www.danga.com/memcached/ memcached] to handle worker threads
 
* Read the modes.xml file directly, and generate the pipeline from it
 
* Read the modes.xml file directly, and generate the pipeline from it
Line 24: Line 124:
 
* Add a sentence splitter: preferably with SRX support (to allow for translation caching)
 
* Add a sentence splitter: preferably with SRX support (to allow for translation caching)
 
* Make the deformatters work as libraries
 
* Make the deformatters work as libraries
  +
  +
== See also ==
  +
  +
* [[Apertium services]]
   
 
[[Category:Development]]
 
[[Category:Development]]

Revision as of 10:30, 15 January 2014

Keeping processes open and using them as daemons lets you avoid some startup time if you do a lot of short translation requests. This is useful if you want to use Apertium on a web server, where starting up all programs in the pipeline for every single request would lead to a lot of lag.

If you want a pre-packaged daemon solution, see Apertium services. This page gives details on how to use some of the same techniques yourself.

Background

Apertium is implemented as a set of separate programs, each performing their individual tasks separately, communicating in the Unix pipeline manner.

Each linguistic package contains a modes.xml file, which specifies which programs are invoked, in which order, and specifies the parameters and datafiles specific to that language pair. Each language pair can contain a number of modes; most of these are used for debugging each stage of the pipeline. At the moment, the modes are converted to a shell script, which is called by the apertium script.

Each program is also effectively implemented as a library; the main() function sets up the environment, parses arguments, and calls a function which performs the task at hand. One method of keeping daemons would be to call the programs through a C++ API, this has not been implemented yet though.

The method described below uses NUL flushing: we start each program as usual, but with an argument that tells it to flush output – and not exit – when it sees a NUL byte. We each text to the program with a NUL byte at the end, and get the output back without the program exiting.

NUL flushing

This section describes how to use the NUL flushing features.

bash

Here's a small set of scripts showing how to do NUL flushing in bash, using apertium-nn-nb as an example.

This is the "daemon"/server script:

#!/bin/bash
# file: daemon

rm -f to from
mkfifo to from
while true; do
  lt-proc -zwe nn-nb.automorf-no-cp.bin & pid=$!
  # Ensure the fifo's are open for the duration of the lt-proc process:
  exec 3>to; exec 4<from
  wait $pid
  echo "restarting..." 1>&2
done

This is the "client" script:

#!/bin/bash
# file: client

exec 3>to
exec 4<from

cat >&3
echo -e '\0' >&3
while read -rd '' <&4; do echo -n "$REPLY"; break; done

python3

Here's a small example in Python 3, using just two programs and hardcoded paths for simplicity:

#!/usr/bin/env python3
import os
from subprocess import Popen, PIPE

class translator():
    def __init__(self):
        self.pipeBeg = Popen(["lt-proc", "-z", "-e", "-w", "/l/n/nn-nb.automorf-no-cp.bin"], stdin=PIPE, stdout=PIPE)
        self.pipeEnd = Popen(["cg-proc", "-z", "-w", "/l/n/nn-nb.rlx.bin"], stdin=self.pipeBeg.stdout, stdout=PIPE)

    def translate(self, string):
        bstring = string
        if type(string) == type(''): bstring = bytes(string, 'utf-8')

        self.pipeBeg.stdin.write(bstring)
        self.pipeBeg.stdin.write(b'\0')
        self.pipeBeg.stdin.flush()

        char = self.pipeEnd.stdout.read(1)
        output = []
        while char and char != b'\0':
            output.append(char)
            char = self.pipeEnd.stdout.read(1)

        return b"".join(output)

t = translator()

print(t.translate("Det der med krikar og krokar[][\n]").decode('utf-8'))
print(t.translate("Eg veit jo kva ein krok er,[][\n]").decode('utf-8'))
print(t.translate("men kva er ein krik?[][\n]").decode('utf-8'))
print(t.translate("Munk peikte på ein krik.[][\n]").decode('utf-8'))
print(t.translate("– Det er ein krik, sa han.[][\n]").decode('utf-8'))

The init function builds up the pipeline. Note the -z argument is given to both lt-proc and cg-proc – the other arguments are the ones used in the modes.xml of the language pair.

The [][\n] is needed because the tools expect the NUL byte to appear after a blank (if you try without it, you'll see the last word get swallowed up). In a real example, you should run natural language text through apertium-destxt (or -deshtml) first. The formatters have a low enough startup time that you don't need to keep their processes open.

Below is an example wrapper function does the deformatting and reformatting:

    def translateHTML(self, string):
        bstring = string
        if type(string) == type(''): bstring = bytes(string, 'utf-8')

        deformat = Popen("apertium-deshtml", stdin=PIPE, stdout=PIPE)
        deformat.stdin.write(bstring)

        translated = self.translate(deformat.communicate()[0])

        reformat = Popen("apertium-rehtml", stdin=PIPE, stdout=PIPE)
        reformat.stdin.write(translated)
        return reformat.communicate()[0]

# […]

print(t.translateHTML("– <em>Det</em> er ein krik, sa han").decode('utf-8'))

History

User:Wynand.winterbach, who developed apertium-dbus, started work towards daemon-like operation, adding the NUL flush option to lt-proc. User:Deadbeef added options to other apertium programs during GsoC 2009.

Ideas

  • Transfer, interchunk, and postchunk should reread the variables section (optimally, caching the location in the XML file on the first read).
  • Reuse thread.c from memcached to handle worker threads
  • Read the modes.xml file directly, and generate the pipeline from it
  • Where possible, link to the apertium functions directly, rather than spawning separate processes (though that will still be required by some language modes)
  • Add a sentence splitter: preferably with SRX support (to allow for translation caching)
  • Make the deformatters work as libraries

See also