Apertium-apy
Apertium-APy stands for "Apertium API in Python". It's a simple Apertium API server written in Python 3, meant as a drop-in replacement for ScaleMT. Its primary/intended purpose is requests from web applications, though it's fairly versatile. It is currently found in the SVN under trunk/apertium-tools/apertium-apy, where servlet.py contains the relevant web server bits. The server is used by front ends like apertium-html-tools (on apertium.org) and Mediawiki Content Translation.
The http://apertium.org page uses the installation at http://apy.projectjj.com which currently only runs released language pairs. However, APY is very easy to set up on your own server, where you can run all the development pairs and even analysers and taggers (like what http://turkic.apertium.org does), read on for how to do that.
Installation
First, install apertium/lttoolbox/apertium-lex-tools, and your language pairs. See Installation for how to do this.
You will need Python 3.3 or newer.
APY uses Tornado 4 + toro as its web framework. Ensure that you install the Python 3 versions of any dependencies. On Debian/Ubuntu, you can do
sudo apt-get install build-essential python3-dev python3-pip zlib1g-dev sudo pip3 install --upgrade toro tornado
Then checkout APY from SVN and run it:
svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-apy cd apertium-apy ./servlet.py /usr/local/share/apertium # the server will use all .mode files from under this directory
Optional arguments include:
- -l --lang-names: path to sqlite3 database of localized language names (see #List localised language names; you should include this if you're using apertium-html-tools)
- -p --port: port to run server on (2737 by default)
- -c --ssl-cert: path to SSL certificate
- -k --ssl-key: path to SSL key file
- -j --num-processes: number of http processes to run (default = 1; use 0 to run one http server per core, where each http server runs all available language pairs)
- -s --nonpairs-path: include .mode files from this directory, like with the main arg, but skip translator (pair) modes, only include analyser/translator/generator modes from this directory (handy for use with apertium SVN)
- -f --missing-freqs: path to sqlite3 database of words that were unknown (requires
sudo apt-get install sqlite3
)
Installing dependencies without root
If you don't have root, you can still install the python dependencies with
$ pip3 install --user --upgrade toro tornado
(But your server still needs build-essential python3-dev python3-pip zlib1g-dev
installed.)
Then you just need to run
PYTHONPATH="/usr/local/lib/python3.2/site-packages:${PYTHONPATH}"; export PYTHONPATH
before starting APY.
Optional features
List localised language names
If you use apertium-html-tools, you probably want localised language names instead of three-letter codes. To get this, first install sqlite3 (on Debian/Ubuntu that's sudo apt-get install sqlite3
), then do
make
to create the langNames.db used for the /listLanguageNames
function.
Language identification
The /identifyLang
function can provide language identification.
If you install Compact Language Detection 2 (CLD2), you get fast and fairly accurate language detection. Installation can be a bit tricky though.
- Ubuntu: see http://blog.xanda.org/2014/04/02/installing-compact-language-detection-2-cld2-on-ubuntu/
- Arch Linux: install python-cld2-hg from AUR.
Alternatively, you can start servlet.py with the -s argument pointing to a directory of language pairs with analyser modes, in which case APY will try to do language detection by analysing the text and finding which analyser had the least unknowns. This is a bit slow though :-)
APY will prefer using CLD2 if it's available, otherwise fall back to analyser coverage.
Usage
APY supports three types of requests: GET, POST, and JSONP. Using GET/POST are possible only if APY is running on the same server as the client due to cross-site scripting restrictions; however, JSONP requests are permitted in any context and will be useful. Using curl, APY can easily be tested:
curl -G --data "lang=kaz-tat&modes=morph&q=алдым" http://localhost:2737/perWord
It can also be tested through your browser or through HTTP calls. Unfortunately, curl does not decode JSON output by default and to make testing easier, a APY Sandbox is provided in the SVN with Apertium-html-tools.
URL | Function | Parameters | Output |
---|---|---|---|
/listPairs | List available language pairs |
|
To be consistent with ScaleMT, the returned JS Object contains a responseData key with an Array of language pair objects with keys sourceLanguage and targetLanguage .
$ curl 'http://localhost:2737/listPairs' {"responseStatus": 200, "responseData": [ {"sourceLanguage": "kaz", "targetLanguage": "tat"}, {"sourceLanguage": "tat", "targetLanguage": "kaz"}, {"sourceLanguage": "mk", "targetLanguage": "en"} ], "responseDetails": null} |
/list | List available mode information |
|
The returned JS Object contains a mapping from language pairs to mode names (used internally by Apertium).
$ curl 'http://localhost:2737/list?q=analyzers' {"mk-en": "mk-en-morph", "en-es": "en-es-anmor", "kaz-tat": "kaz-tat-morph", "tat-kaz": "tat-kaz-morph", "fin": "fin-morph", "es-en": "es-en-anmor", "kaz": "kaz-morph"} $ curl 'http://localhost:2737/list?q=generators' {"en-es": "en-es-generador", "fin": "fin-gener", "es-en": "es-en-generador"} $ curl 'http://localhost:2737/list?q=taggers' {"es-en": "es-en-tagger", "en-es": "en-es-tagger", "mk-en": "mk-en-tagger", "tat-kaz": "tat-kaz-tagger", "kaz-tat": "kaz-tat-tagger", "kaz": "kaz-tagger"} |
/translate | Translate text |
|
To be consistent with ScaleMT, the returned JS Object contains a responseData key with an JS Object that has key translatedText that contains the translated text.
$ curl 'http://localhost:2737/translate?langpair=kaz|tat&q=Сен+бардың+ба?' {"responseStatus": 200, "responseData": {"translatedText": "Син барныңмы?"}, "responseDetails": null} |
/analyze or /analyse | Morphologically analyze text |
|
The returned JS Array contains JS Arrays in the format [analysis, input-text] .
$ curl -G --data "lang=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze [["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "], ["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"], ["?/?<sent>","?"]] |
/generate | Generate surface forms from text |
|
The returned JS Array contains JS Arrays in the format [generated, input-text] .
$ curl -G --data "lang=kaz&q=^сен<v><tv><imp><p2><sg>$" http://localhost:2737/generate [["сен","^сен<v><tv><imp><p2><sg>$ "]] |
/perWord | Perform morphological tasks per word |
|
The returned JS Array contains JS Objects each containing the key input and up to 4 other keys corresponding to the requested modes (tagger , morph , biltrans and translate ).
curl 'http://localhost:2737/perWord?lang=en-es&modes=morph&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"]}, {"input": "there", "morph": ["there<adv>"]}, {"input": "be", "morph": ["be<vbser><inf>"]}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger&q=let+there+be+light' [{"input": "let", "tagger": "let<vblex><pp>"}, {"input": "there", "tagger": "there<adv>"}, {"input": "be", "tagger": "be<vbser><inf>"}, {"input": "light", "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+tagger&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "morph": ["there<adv>"], "tagger": "there<adv>"}, {"input": "be", "morph": ["be<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=translate&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"]}, {"input": "there", "translate": ["all\u00ed<adv>"]}, {"input": "be", "translate": ["ser<vbser><inf>"]}, {"input": "light", "translate": ["ligero<adj>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=biltrans&q=let+there+be+light' [{"input": "let", "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=translate+biltrans&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "translate": ["all\u00ed<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "translate": ["ser<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "translate": ["ligero<adj>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+biltrans&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger+biltrans&q=let+there+be+light' [{"input": "let", "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger+translate&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "translate": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "translate": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "translate": ["ligero<adj>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"]}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"]}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"]}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=translate+biltrans+tagger&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "translate": ["all\u00ed<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "translate": ["ser<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "translate": ["ligero<adj>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+biltrans+tagger&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+tagger&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "tagger": "there<adv>"}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+biltrans&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+biltrans+tagger&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] |
/listLanguageNames | Get localized language names |
|
The returned JS Object contains a mapping of requested language codes to localized language names
$ curl 'http://localhost:2737/listLanguageNames?locale=fr&languages=ca+en+mk+tat+kk' {"ca": "catalan", "en": "anglais", "kk": "kazakh", "mk": "macédonien", "tat": "tatar"} |
/calcCoverage | Get coverage of a language on a text |
|
The returned JS Array contains a single floating point value ≤ 1 that indicates the coverage.
$ curl 'http://localhost:2737/getCoverage?lang=en-es&q=Whereas disregard and contempt for which have outraged the conscience of mankind' [0.9230769230769231] |
/identifyLang | Return a list of languages with probabilities of the text being in that language. Uses CLD2 if that's installed, otherwise will try any analyser modes. |
|
The returned JS Object contains a mapping from language codes to probabilities.
$ curl 'http://localhost:2737/identifyLang?q=This+is+a+piece+of+text.' {"ca": 0.19384234, "en": 0.98792465234, "kk": 0.293442432, "zh": 0.002931001} |
SSL
APY supports HTTPS out of the box. To test with a self-signed signature, create a certificate and key by running:
openssl req -new -x509 -keyout server.key -out server.crt -days 365 -nodes
Then run APY with --sslKey server.key --sslCert server.crt
, and test with HTTPS and the -k argument to curl (-k means curl accepts self-signed or even slightly "lying" signatures):
curl -k -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://localhost:2737/analyze
If you have a real signed certificate, you should be able to use curl without -k for the domain which the certificate is signed for:
curl -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://oohlookatmeimencrypted.com:2737/analyze
Remember to open port 2737 to your server.
Gateway
A gateway for APY is located in the same SVN directory and provides functionality such as silently intercepting and forwarding requests, and aggregating APY instance capabilities for overriding /list
requests. For example, a gateway provided access to two servers with varied capabilities, in terms of language pairs, will report aggregated capabilities to the client, hiding the existence of two servers.
A list of APY servers is a required positional argument; an example server list is provided in the same SVN directory. If the gateway is requested to run on a already occupied port, it will attempt to traverse the available ports until it can bind on to a free one.
The gateway currently operates on a Fastest paradigm load balancer that continuously adapts to changing circumstances by basing its routing on the client's requests. On initialization, all servers are assigned a weight of 0 and consequently, each server will be eventually utilized as the gateway determines the server speeds. The gateway stores a moving average of the last x requests for each (mode, language)
and forwards requests to the fastest server as measured in units of response time per response length.
Upstart scripts
You can use upstart scripts to automatically run the apy and html-tools on startup and respawn the processes when they get killed. If you don't have upstart installed: sudo apt-get install upstart
The apertiumconfig file contains paths of some apertium directories and the serverlist file. It can be saved anywhere. Make sure the paths are correct!
/home/user/apertiumconfig
APERTIUMPATH=/home/user APYPATH=/home/user/apertium-apy SERVERLIST=/home/user/serverlist HTMLTOOLSPATH=/home/user/apertium-html-tools #optional, see 'Logging': LOGFILE=/home/user/apertiumlog
The following upstart scripts have to be saved in /etc/init
.
apertium-all.conf
description "start/stop all apertium services" start on startup
apertium-apy.conf
description "apertium-apy init script" start on starting apertium-all stop on stopped apertium-all respawn respawn limit 50 300 env CONFIG=/etc/default/apertium script . $CONFIG python3 $APYPATH/servlet.py $APERTIUMPATH end script
apertium-apy-gateway.conf
description "apertium-apy gateway init script" start on starting apertium-all stop on stopped apertium-all respawn respawn limit 50 300 env CONFIG=/home/user/apertiumconfig script . $CONFIG python3 $APYPATH/gateway.py $SERVERLIST end script
apertium-html-tools.conf
description "apertium-html-tools init script" start on starting apertium-all stop on stopped apertium-all respawn respawn limit 50 300 env CONFIG=/etc/default/apertium script . $CONFIG cd $HTMLTOOLSPATH python3 -m http.server 8888 end script
Use sudo start apertium-all
to start all services. Just like the filenames, the jobs are called apertium-apy
, apertium-apy-gateway
and apertium-html-tools
.
The jobs can be independently started by: sudo start JOB
You can stop them by using sudo stop JOB
Restart: sudo restart JOB
View the status and PID: sudo status JOB
Logging
The log files of the processes can be found in the /var/log/upstart/
folder.
The starting/stopping of the jobs can be logged by appending this to the end of apertium-apy.conf
, apertium-apy-gateway.conf
and apertium-html-tools.conf
files.
pre-start script . $CONFIG touch $LOGFILE echo "`date` $UPSTART_JOB started" >> $LOGFILE end script post-stop script . $CONFIG touch $LOGFILE echo "`date` $UPSTART_JOB stoppped" >> $LOGFILE end script
TODO
- hfst-proc -g and lrx-proc don't work with null-flushing, see https://sourceforge.net/p/hfst/bugs/240/ and https://sourceforge.net/p/apertium/tickets/45/
- translation cache
- variants like ca_valencia, oc_aran and pt_BR look odd on the web page?
- gateway: we need a way to have a second server running only the most popular language pairs, and a gateway that sends requests to whichever server has the requested pair. Simply doing -j2 is not a good solution, since we'd waste a lot of RAM on keeping open pipelines that are rarely used. (Or we could turn off pipelines after not being used for a while …)
Troubleshooting
If you encounter errors involving enable_pretty_logging()
while starting APY, comment out the line with a leading #
to solve the issue.
- What was the error? This should be possible to fix / work around.
High IO usage
If you are logging unknowns (-f / --missing-freqs), you should probably also give some value to -M (e.g. -M1000), otherwise you might get a lot of disk usage on that sqlite file.
'return' with argument inside generator on python 3.2 or older
Traceback (most recent call last): File "./servlet.py", line 25, in <module> import translation File "translation.py", line 132 return proc_reformat.communicate()[0].decode('utf-8') SyntaxError: 'return' with argument inside generator
Solution: upgrade to Python 3.3 or newer.