Apertium-apy

From Apertium
Jump to navigation Jump to search

Apertium-APy stands for "Apertium API in Python". It's a simple Apertium API server written in Python 3, meant as a drop-in replacement for ScaleMT. It is currently found in the SVN under trunk/apertium-tools/apertium-apy, where servlet.py contains the relevant web server bits. This is meant for front ends like apertium-html-tools.


The http://apertium.org page uses the installation at http://apy.projectjj.com which currently only runs released language pairs. However, APY is very easy to set up on your own server, where you can run all the development pairs and even analysers and taggers, read on for how to do that.

Installation

First, compile and install apertium/lttoolbox/apertium-lex-tools, and compile your language pairs. See Minimal_installation_from_SVN for how to do this.

APY uses Tornado as its web framework. Ensure that you install the Python 3 versions of any dependencies. On Debian/Ubuntu, you can do

sudo apt-get install python3-tornado

Or you can install it via pip3 install tornado or other variants depending on your environment.

Then checkout APY from SVN and run it:

svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-apy
cd apertium-apy
./servlet.py /usr/local/share/apertium   # all .mode files from under this directory are included


Optional arguments include:

  • -l --lang-names: path to sqlite database of localized language names (unicode.db by default)
  • -p --port: port to run server on (2737 by default)
  • -c --ssl-cert: path to SSL certificate
  • -k --ssl-key: path to SSL key file
  • -x --num-processes: number of child processes (default to number of cores)
  • -s --nonpairs-path: include .mode files from this directory, like with the main arg, but skip translator (pair) modes, only include analyser/translator/generator modes from this directory (handy for use with apertium SVN)

Optional features

List localised language names

If you have sqlite3, you can do

make

to create the unicode.db used for the /listLanguageNames function.

Language identification

The /identifyLang function can provide language identification.

If you install Compact Language Detection 2 (CLD2), you get fast and fairly accurate language detection. Installation can be a bit tricky though.


Alternatively, you can start servlet.py with the -s argument pointing to a directory of language pairs with analyser modes, in which case APY will try to do language detection by analysing the text and finding which analyser had the least unknowns. This is a bit slow though :-)

APY will prefer using CLD2 if it's available, otherwise fall back to analyser coverage.

Usage

APY supports three types of requests: GET, POST, and JSONP. Using GET/POST are possible only if APY is running on the same server as the client due to cross-site scripting restrictions; however, JSONP requests are permitted in any context and will be useful. Using curl, APY can easily be tested:

curl -G --data "lang=kaz-tat&modes=morph&q=алдым" http://localhost:2737/perWord

It can also be tested through your browser or through HTTP calls. Unfortunately, curl does not decode JSON output by default and to make testing easier, a APY Sandbox is provided in the SVN with Apertium-html-tools.

URL Function Parameters Output
/listPairs List available language pairs
  • include_deprecated_codes: give this parameter to include old ISO-639-1 codes in output
To be consistent with ScaleMT, the returned JS Object contains a responseData key with an Array of language pair objects with keys sourceLanguage and targetLanguage.
$ curl 'http://localhost:2737/listPairs'

{"responseStatus": 200, "responseData": [
 {"sourceLanguage": "kaz", "targetLanguage": "tat"}, 
 {"sourceLanguage": "tat", "targetLanguage": "kaz"}, 
 {"sourceLanguage": "mk", "targetLanguage": "en"}
], "responseDetails": null}
/list List available mode information
  • q: type of information to list
    • pairs (alias for /listPairs)
    • analyzers/analysers
    • generators
    • taggers/disambiguators
The returned JS Object contains a mapping from language pairs to mode names (used internally by Apertium).
$ curl 'http://localhost:2737/list?q=analyzers'
{"mk-en": "mk-en-morph", "en-es": "en-es-anmor", "kaz-tat": "kaz-tat-morph", 
 "tat-kaz": "tat-kaz-morph", "fin": "fin-morph", "es-en": "es-en-anmor", "kaz": "kaz-morph"}
$ curl 'http://localhost:2737/list?q=generators'
{"en-es": "en-es-generador", "fin": "fin-gener", "es-en": "es-en-generador"}
$ curl 'http://localhost:2737/list?q=taggers'
{"es-en": "es-en-tagger", "en-es": "en-es-tagger", "mk-en": "mk-en-tagger",
 "tat-kaz": "tat-kaz-tagger", "kaz-tat": "kaz-tat-tagger", "kaz": "kaz-tagger"}
/translate Translate text
  • langpair: language pair to use for translation
  • q: text to translate
To be consistent with ScaleMT, the returned JS Object contains a responseData key with an JS Object that has key translatedText that contains the translated text.
$ curl 'http://localhost:2737/translate?langpair=kaz|tat&q=Сен+бардың+ба?'
{"responseStatus": 200, "responseData": {"translatedText": "Син барныңмы?"}, "responseDetails": null}
/analyze or /analyse Morphologically analyze text
  • mode: language to use for analysis
  • q: text to analyze
The returned JS Array contains JS Arrays in the format [analysis, input-text].
$ curl -G --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze
[["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "], ["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"], ["?/?<sent>","?"]]
/generate Generate surface forms from text
  • mode: language to use for generation
  • q: text to generate
The returned JS Array contains JS Arrays in the format [generated, input-text].
$ curl -G --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$" http://localhost:2737/generate
[["сен","^сен<v><tv><imp><p2><sg>$ "]]
/perWord Perform morphological tasks per word
  • lang: language to use for tasks
  • modes: morphological tasks to perform on text (15 combinations possible - delimit using '+')
    • tagger/disambig
    • biltrans
    • translate
    • morph
  • q: text to perform tasks on
The returned JS Array contains JS Objects each containing the key input and up to 4 other keys corresponding to the requested modes (tagger, morph, biltrans and translate).
curl 'http://localhost:2737/perWord?lang=en-es&modes=morph&q=let+there+be+light'
[{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"]}, {"input": "there", "morph": ["there<adv>"]}, {"input": "be", "morph": ["be<vbser><inf>"]}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"]}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger&q=let+there+be+light'
[{"input": "let", "tagger": "let<vblex><pp>"}, {"input": "there", "tagger": "there<adv>"}, {"input": "be", "tagger": "be<vbser><inf>"}, {"input": "light", "tagger": "light<adj><sint>"}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+tagger&q=let+there+be+light'
[{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "morph": ["there<adv>"], "tagger": "there<adv>"}, {"input": "be", "morph": ["be<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "tagger": "light<adj><sint>"}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=translate&q=let+there+be+light'
[{"input": "let", "translate": ["dejar<vblex><pp>"]}, {"input": "there", "translate": ["all\u00ed<adv>"]}, {"input": "be", "translate": ["ser<vbser><inf>"]}, {"input": "light", "translate": ["ligero<adj>"]}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=biltrans&q=let+there+be+light'
[{"input": "let", "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=translate+biltrans&q=let+there+be+light'
[{"input": "let", "translate": ["dejar<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "translate": ["all\u00ed<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "translate": ["ser<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "translate": ["ligero<adj>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+biltrans&q=let+there+be+light'
[{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger+biltrans&q=let+there+be+light'
[{"input": "let", "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger+translate&q=let+there+be+light'
[{"input": "let", "translate": ["dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "translate": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "translate": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "translate": ["ligero<adj>"], "tagger": "light<adj><sint>"}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate&q=let+there+be+light'
[{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"]}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"]}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"]}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"]}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=translate+biltrans+tagger&q=let+there+be+light'
[{"input": "let", "translate": ["dejar<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "translate": ["all\u00ed<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "translate": ["ser<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "translate": ["ligero<adj>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+biltrans+tagger&q=let+there+be+light'
[{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+tagger&q=let+there+be+light'
[{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "tagger": "there<adv>"}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "tagger": "light<adj><sint>"}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+biltrans&q=let+there+be+light'
[{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}]

curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+biltrans+tagger&q=let+there+be+light'
[{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}]

/listLanguageNames Get localized language names
  • locale: language to get localized language names in
  • languages: list of '+' delimited language codes to retrieve localized names for (optional - if not specified, all available codes will be returned)
The returned JS Object contains a mapping of requested language codes to localized language names
$ curl 'http://localhost:2737/listLanguageNames?locale=fr&languages=ca+en+mk+tat+kk'
{"ca": "catalan", "en": "anglais", "kk": "kazakh", "mk": "macédonien", "tat": "tatar"}
/coverage Get coverage of a language on a text
  • mode: language to analyze with
  • q: text to analyze for coverage
The returned JS Array contains a single floating point value ≤ 1 that indicates the coverage.
$ curl 'http://localhost:2737/getCoverage?mode=en-es&q=Whereas disregard and contempt for which have outraged the conscience of mankind'
[0.9230769230769231]
/identifyLang Return a list of languages with probabilities of the text being in that language. Uses CLD2 if that's installed, otherwise will try any analyser modes.
  • q: text which you would like to compute probabilities for
The returned JS Object contains a mapping from language codes to probabilities.
$ curl 'http://localhost:2737/identifyLang?q=This+is+a+piece+of+text.'
{"ca": 0.19384234, "en": 0.98792465234, "kk": 0.293442432, "zh": 0.002931001}

SSL

APY supports HTTPS out of the box. To test with a self-signed signature, create a certificate and key by running:

openssl req -new -x509 -keyout server.key -out server.crt -days 365 -nodes

Then run APY with --sslKey server.key --sslCert server.crt, and test with HTTPS and the -k argument to curl (-k means curl accepts self-signed or even slightly "lying" signatures):

curl -k -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://localhost:2737/analyze

If you have a real signed certificate, you should be able to use curl without -k for the domain which the certificate is signed for:

curl -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://oohlookatmeimencrypted.com:2737/analyze

Remember to open port 2737 to your server.

Gateway

A gateway for APY is located in the same SVN directory and provides functionality such as silently intercepting and forwarding requests, and aggregating APY instance capabilities for overriding /list requests. For example, a gateway provided access to two servers with varied capabilities, in terms of language pairs, will report aggregated capabilities to the client, hiding the existence of two servers.

A list of APY servers is a required positional argument; an example server list is provided in the same SVN directory. If the gateway is requested to run on a already occupied port, it will attempt to traverse the available ports until it can bind on to a free one.

The gateway currently operates on a Fastest paradigm load balancer that continuously adapts to changing circumstances by basing its routing on the client's requests. On initialization, all servers are assigned a weight of 0 and consequently, each server will be eventually utilized as the gateway determines the server speeds. The gateway stores a moving average of the last x requests for each (mode, language) and forwards requests to the fastest server as measured in units of response time per response length.

Upstart scripts

You can use upstart scripts to automatically run the apy and html-tools on startup and respawn the processes when they get killed. If you don't have upstart installed: sudo apt-get install upstart

The apertiumconfig file contains paths of some apertium directories and the serverlist file. It can be saved anywhere. Make sure the paths are correct!

/home/user/apertiumconfig

APERTIUMPATH=/home/user
APYPATH=/home/user/apertium-apy
SERVERLIST=/home/user/serverlist
HTMLTOOLSPATH=/home/user/apertium-html-tools
#optional, see 'Logging':
LOGFILE=/home/user/apertiumlog  

The following upstart scripts have to be saved in /etc/init.

apertium-all.conf

description "start/stop all apertium services"
     
start on startup

apertium-apy.conf

description "apertium-apy init script"

start on starting apertium-all
stop on stopped apertium-all
respawn
respawn limit 50 300

env CONFIG=/etc/default/apertium

script
    . $CONFIG
    python3 $APYPATH/servlet.py $APERTIUMPATH
end script

apertium-apy-gateway.conf

description "apertium-apy gateway init script"
     
start on starting apertium-all
stop on stopped apertium-all
respawn
respawn limit 50 300
     
env CONFIG=/home/user/apertiumconfig

script
    . $CONFIG
    python3 $APYPATH/gateway.py $SERVERLIST
end script 

apertium-html-tools.conf

description "apertium-html-tools init script"
           
start on starting apertium-all
stop on stopped apertium-all
respawn
respawn limit 50 300
     
env CONFIG=/etc/default/apertium

script
    . $CONFIG
    cd $HTMLTOOLSPATH
    python3 -m http.server 8888
end script

Use sudo start apertium-all to start all services. Just like the filenames, the jobs are called apertium-apy, apertium-apy-gateway and apertium-html-tools.

The jobs can be independently started by: sudo start JOB

You can stop them by using sudo stop JOB

Restart: sudo restart JOB

View the status and PID: sudo status JOB

Logging

The log files of the processes can be found in the /var/log/upstart/ folder.

The starting/stopping of the jobs can be logged by appending this to the end of apertium-apy.conf, apertium-apy-gateway.conf and apertium-html-tools.conf files.

pre-start script
	. $CONFIG
	touch $LOGFILE
	echo "`date` $UPSTART_JOB started" >> $LOGFILE	
end script

post-stop script
	. $CONFIG
	touch $LOGFILE
	echo "`date` $UPSTART_JOB stoppped" >> $LOGFILE	
end script

TODO

  • It should be possible to set a time-out for translation threads, so if a translation is taking too long, it gets killed and the queue moves along.
  • It should use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running.
  • http://stackoverflow.com/a/487281/69663 recommends select/polling over threading (http://docs.python.org/3.3/library/socketserver.html for more on the differences) but requires either lots of manually written dispatching code (http://pymotw.com/2/select/) or a framework like Twisted.
  • some language pairs still don't work (sme-nob?)
  • hfst-proc -g doesn't work with null-flushing (or?)
  • translation cache
  • add support for ca_valencia, oc_aran and pt_BR
  • http://apy.projectjj.com/ currently shows a 404, / should show some sort of general info about the server and a link to this wiki page

Troubleshooting

If you encounter errors involving enable_pretty_logging() while starting APY, comment out the line with a leading # to solve the issue.

What was the error? This should be possible to fix / work around.

Docs