Difference between revisions of "Apertium-apy"
Line 39: | Line 39: | ||
</pre> |
</pre> |
||
to create the unicode.db used for the /listLanguageNames function. |
to create the unicode.db used for the <code>/listLanguageNames</code> function. |
||
====Language identification==== |
====Language identification==== |
||
The <code>/identifyLang</code> function can provide language identification. |
|||
If you install CLD2, you get fast and fairly accurate language detection, see the section [[Apertium-apy#CLD2_for_better_language_detection]]. |
If you install CLD2, you get fast and fairly accurate language detection, see the section [[Apertium-apy#CLD2_for_better_language_detection]]. |
||
Revision as of 12:27, 31 May 2014
Apertium-APy stands for "Apertium API in Python". It's a simple Apertium API server written in Python 3, meant as a drop-in replacement for ScaleMT. It is currently found in the SVN under trunk/apertium-tools/apertium-apy, where servlet.py contains the relevant web server bits. This is meant for front ends like apertium-html-tools.
The http://apertium.org page uses the installation at http://apy.projectjj.com which currently only runs released language pairs. However, APY is very easy to set up on your own server, where you can run all the development pairs and even analysers and taggers, read on for how to do that.
Installation
First, compile and install apertium/lttoolbox/apertium-lex-tools, and compile your language pairs. See Minimal_installation_from_SVN for how to do this.
APY uses Tornado as its web framework. Ensure that you install the Python 3 versions of any dependencies. On Debian/Ubuntu, you can do
sudo apt-get install python3-tornado
Or you can install it via pip install tornado
or other variants depending on your environment.
Then checkout APY from SVN and run it:
svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-apy cd apertium-apy ./servlet.py /usr/local/share/apertium # all .mode files from under this directory are included
Optional arguments include:
- -l --lang-names: path to sqlite database of localized language names (
unicode.db
by default) - -p --port: port to run server on (2737 by default)
- -c --ssl-cert: path to SSL certificate
- -k --ssl-key: path to SSL key file
- -x --num-processes: number of child processes (default to number of cores)
- -s --nonpairs-path: include .mode files from this directory, like with the main arg, but skip translator (pair) modes, only include analyser/translator/generator modes from this directory (handy for use with apertium SVN)
Optional features
List localised language names
If you have sqlite3, you can do
make
to create the unicode.db used for the /listLanguageNames
function.
Language identification
The /identifyLang
function can provide language identification.
If you install CLD2, you get fast and fairly accurate language detection, see the section Apertium-apy#CLD2_for_better_language_detection.
Alternatively, you can use -s to point to a directory of language pairs with analyser modes, in which case APY will try to do language detection by analysing the text and finding which analyser had the least unknowns. This is a bit slow though :-)
Usage
APY supports three types of requests: GET, POST, and JSONP. Using GET/POST are possible only if APY is running on the same server as the client due to cross-site scripting restrictions; however, JSONP requests are permitted in any context and will be useful. Using curl, APY can easily be tested:
curl -G --data "lang=kaz-tat&modes=morph&q=алдым" http://localhost:2737/perWord
It can also be tested through your browser or through HTTP calls. Unfortunately, curl does not decode JSON output by default and to make testing easier, a APY Sandbox is provided in the SVN with Apertium HTML-Tools at /trunk/apertium-tools/apertium-html-tools.
URL | Function | Parameters | Output |
---|---|---|---|
/listPairs | List available language pairs | None | To be consistent with ScaleMT, the returned JS Object contains a responseData key with an Array of language pair objects with keys sourceLanguage and targetLanguage .
$ curl 'http://localhost:2737/listPairs' {"responseStatus": 200, "responseData": [ {"sourceLanguage": "kaz", "targetLanguage": "tat"}, {"sourceLanguage": "tat", "targetLanguage": "kaz"}, {"sourceLanguage": "mk", "targetLanguage": "en"} ], "responseDetails": null} |
/list | List available mode information |
|
The returned JS Object contains a mapping from language pairs to mode names (used internally by Apertium).
$ curl 'http://localhost:2737/list?q=analyzers' {"mk-en": "mk-en-morph", "en-es": "en-es-anmor", "kaz-tat": "kaz-tat-morph", "tat-kaz": "tat-kaz-morph", "fin": "fin-morph", "es-en": "es-en-anmor", "kaz": "kaz-morph"} $ curl 'http://localhost:2737/list?q=generators' {"en-es": "en-es-generador", "fin": "fin-gener", "es-en": "es-en-generador"} $ curl 'http://localhost:2737/list?q=taggers' {"es-en": "es-en-tagger", "en-es": "en-es-tagger", "mk-en": "mk-en-tagger", "tat-kaz": "tat-kaz-tagger", "kaz-tat": "kaz-tat-tagger", "kaz": "kaz-tagger"} |
/translate | Translate text |
|
To be consistent with ScaleMT, the returned JS Object contains a responseData key with an JS Object that has key translatedText that contains the translated text.
$ curl 'http://localhost:2737/translate?langpair=kaz|tat&q=Сен+бардың+ба?' {"responseStatus": 200, "responseData": {"translatedText": "Син барныңмы?"}, "responseDetails": null} |
/analyze or /analyse | Morphologically analyze text |
|
The returned JS Array contains JS Arrays in the format [analysis, input-text] .
$ curl -G --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze [["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "], ["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"], ["?/?<sent>","?"]] |
/generate | Generate surface forms from text |
|
The returned JS Array contains JS Arrays in the format [generated, input-text] .
$ curl -G --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$" http://localhost:2737/generate [["сен","^сен<v><tv><imp><p2><sg>$ "]] |
/perWord | Perform morphological tasks per word |
|
The returned JS Array contains JS Objects each containing the key input and up to 4 other keys corresponding to the requested modes (tagger , morph , biltrans and translate ).
curl 'http://localhost:2737/perWord?lang=en-es&modes=morph&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"]}, {"input": "there", "morph": ["there<adv>"]}, {"input": "be", "morph": ["be<vbser><inf>"]}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger&q=let+there+be+light' [{"input": "let", "tagger": "let<vblex><pp>"}, {"input": "there", "tagger": "there<adv>"}, {"input": "be", "tagger": "be<vbser><inf>"}, {"input": "light", "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+tagger&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "morph": ["there<adv>"], "tagger": "there<adv>"}, {"input": "be", "morph": ["be<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=translate&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"]}, {"input": "there", "translate": ["all\u00ed<adv>"]}, {"input": "be", "translate": ["ser<vbser><inf>"]}, {"input": "light", "translate": ["ligero<adj>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=biltrans&q=let+there+be+light' [{"input": "let", "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=translate+biltrans&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "translate": ["all\u00ed<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "translate": ["ser<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "translate": ["ligero<adj>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+biltrans&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger+biltrans&q=let+there+be+light' [{"input": "let", "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=tagger+translate&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "translate": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "translate": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "translate": ["ligero<adj>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"]}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"]}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"]}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=translate+biltrans+tagger&q=let+there+be+light' [{"input": "let", "translate": ["dejar<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "translate": ["all\u00ed<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "translate": ["ser<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "translate": ["ligero<adj>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+biltrans+tagger&q=let+there+be+light' [{"input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+tagger&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "tagger": "there<adv>"}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "tagger": "light<adj><sint>"}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+biltrans&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"]}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"]}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"]}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] curl 'http://localhost:2737/perWord?lang=en-es&modes=morph+translate+biltrans+tagger&q=let+there+be+light' [{"translate": ["dejar<vblex><pp>"], "input": "let", "morph": ["let<vblex><inf>", "let<vblex><pres>", "let<vblex><past>", "let<vblex><pp>"], "biltrans": ["dejar<vblex><inf>", "dejar<vblex><pres>", "dejar<vblex><past>", "dejar<vblex><pp>"], "tagger": "let<vblex><pp>"}, {"translate": ["all\u00ed<adv>"], "input": "there", "morph": ["there<adv>"], "biltrans": ["all\u00ed<adv>"], "tagger": "there<adv>"}, {"translate": ["ser<vbser><inf>"], "input": "be", "morph": ["be<vbser><inf>"], "biltrans": ["ser<vbser><inf>"], "tagger": "be<vbser><inf>"}, {"translate": ["ligero<adj>"], "input": "light", "morph": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "biltrans": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"], "tagger": "light<adj><sint>"}] |
/listLanguageNames | Get localized language names |
|
The returned JS Object contains a mapping of requested language codes to localized language names
$ curl 'http://localhost:2737/listLanguageNames?locale=fr&languages=ca+en+mk+tat+kk' {"ca": "catalan", "en": "anglais", "kk": "kazakh", "mk": "macédonien", "tat": "tatar"} |
/coverage | Get coverage of a language on a text |
|
The returned JS Array contains a single floating point value ≤ 1 that indicates the coverage.
$ curl 'http://localhost:2737/getCoverage?mode=en-es&q=Whereas disregard and contempt for which have outraged the conscience of mankind' [0.9230769230769231] |
/identifyLang | Return a list of languages with probabilities of the text being in that language. Uses CLD2 if that's installed, otherwise will try any analyser modes. |
|
The returned JS Object contains a mapping from language codes to probabilities.
$ curl 'http://localhost:2737/identifyLang?q=This+is+a+piece+of+text.' {"ca": 0.19384234, "en": 0.98792465234, "kk": 0.293442432, "zh": 0.002931001} |
CLD2 for better language detection
APY uses Compact Language Detection 2 for language detection if it's available (otherwise, it will try to use analyser modes if any are available).
See http://blog.xanda.org/2014/04/02/installing-compact-language-detection-2-cld2-on-ubuntu/ on how to install this on Ubuntu.
On Arch Linux, install python-cld2-hg from AUR.
SSL
APY supports HTTPS out of the box. To test with a self-signed signature, create a certificate and key by running:
openssl req -new -x509 -keyout server.key -out server.crt -days 365 -nodes
Then run APY with --sslKey server.key --sslCert server.crt
, and test with HTTPS and the -k argument to curl (-k means curl accepts self-signed or even slightly "lying" signatures):
curl -k -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://localhost:2737/analyze
If you have a real signed certificate, you should be able to use curl without -k for the domain which the certificate is signed for:
curl -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://oohlookatmeimencrypted.com:2737/analyze
Remember to open port 2737 to your server.
Gateway
A gateway for APY is located in the same SVN directory and provides functionality such as silently intercepting and forwarding requests, and aggregating APY instance capabilities for overriding /list
requests. For example, a gateway provided access to two servers with varied capabilities, in terms of language pairs, will report aggregated capabilities to the client, hiding the existence of two servers.
A list of APY servers is a required positional argument; an example server list is provided in the same SVN directory. If the gateway is requested to run on a already occupied port, it will attempt to traverse the available ports until it can bind on to a free one.
The gateway currently operates on a Fastest paradigm load balancer that continuously adapts to changing circumstances by basing its routing on the client's requests. On initialization, all servers are assigned a weight of 0 and consequently, each server will be eventually utilized as the gateway determines the server speeds. The gateway stores a moving average of the last x requests for each (mode, language)
and forwards requests to the fastest server as measured in units of response time per response length.
Upstart scripts
You can use upstart scripts to automatically run the apy and html-tools on startup and respawn the processes when they get killed. If you don't have upstart installed: sudo apt-get install upstart
The apertiumconfig file contains paths of some apertium directories and the serverlist file. It can be saved anywhere. Make sure the paths are correct!
/home/user/apertiumconfig
APERTIUMPATH=/home/user APYPATH=/home/user/apertium-apy SERVERLIST=/home/user/serverlist HTMLTOOLSPATH=/home/user/apertium-html-tools #optional, see 'Logging': LOGFILE=/home/user/apertiumlog
The following upstart scripts have to be saved in /etc/init
.
apertium-all.conf
description "start/stop all apertium services" start on startup
apertium-apy.conf
description "apertium-apy init script" start on starting apertium-all stop on stopped apertium-all respawn respawn limit 50 300 env CONFIG=/etc/default/apertium script . $CONFIG python3 $APYPATH/servlet.py $APERTIUMPATH end script
apertium-apy-gateway.conf
description "apertium-apy gateway init script" start on starting apertium-all stop on stopped apertium-all respawn respawn limit 50 300 env CONFIG=/home/user/apertiumconfig script . $CONFIG python3 $APYPATH/gateway.py $SERVERLIST end script
apertium-html-tools.conf
description "apertium-html-tools init script" start on starting apertium-all stop on stopped apertium-all respawn respawn limit 50 300 env CONFIG=/etc/default/apertium script . $CONFIG cd $HTMLTOOLSPATH python3 -m http.server 8888 end script
Use sudo start apertium-all
to start all services. Just like the filenames, the jobs are called apertium-apy
, apertium-apy-gateway
and apertium-html-tools
.
The jobs can be independently started by: sudo start JOB
You can stop them by using sudo stop JOB
Restart: sudo restart JOB
View the status and PID: sudo status JOB
Logging
The log files of the processes can be found in the /var/log/upstart/
folder.
The starting/stopping of the jobs can be logged by appending this to the end of apertium-apy.conf
, apertium-apy-gateway.conf
and apertium-html-tools.conf
files.
pre-start script . $CONFIG touch $LOGFILE echo "`date` $UPSTART_JOB started" >> $LOGFILE end script post-stop script . $CONFIG touch $LOGFILE echo "`date` $UPSTART_JOB stoppped" >> $LOGFILE end script
TODO
- It should be possible to set a time-out for translation threads, so if a translation is taking too long, it gets killed and the queue moves along.
- It should use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running.
- http://stackoverflow.com/a/487281/69663 recommends select/polling over threading (http://docs.python.org/3.3/library/socketserver.html for more on the differences) but requires either lots of manually written dispatching code (http://pymotw.com/2/select/) or a framework like Twisted.
- some language pairs still don't work (sme-nob?)
- hfst-proc -g doesn't work with null-flushing (or?)
- translation cache
- add support for ca_valencia, oc_aran and pt_BR
- http://apy.projectjj.com/ currently shows a 404, / should show some sort of general info about the server and a link to this wiki page
Troubleshooting
If you encounter errors involving enable_pretty_logging()
while starting APY, comment out the line with a leading #
to solve the issue.
- What was the error? This should be possible to fix / work around.