Apertium-apy
Apertium-APy stands for "Apertium API in Python". It's a simple apertium API server written in python, meant as a drop-in replacement for ScaleMT. It is currently found in the svn under trunk/apertium-tools/apertium-apy, where servlet.py is basically its entirety. This is meant for front ends like the simple one in trunk/apertium-tools/simple-html (where index.html is the main deal).
Installation
First, compile and install apertium/lttoolbox/apertium-lex-tools, and compile your language pairs. See Minimal_installation_from_SVN for how to do this. APY uses Tornado as its web framework, install it via pip install tornado
or other variants depending on your environment. Then checkout APY from SVN and run it:
svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-apy cd apertium-apy export APERTIUMPATH="/path/to/apertium/svn/trunk" ./servlet.py "$APERTIUMPATH"
Optional arguments include:
- --langNamesDB: path to database of localized language names
- -port --port: port to run server on (2737 by default)
- --ssl: path to SSL certificate
Usage
APY supports three types of requests: GET, POST, and JSONP. Using GET/POST are possible only if APY is running on the same server as the client due to cross-site scripting restrictions; however, JSONP requests are permitted in any context and will be useful. Using curl, APY can easily be tested:
curl --data "lang=kaz-tat&modes=morph&q=алдым" http://localhost:2737/perWord
Note that this sends a POST request, using curl or your browser to send a GET request is also possible.
URL | Function | Parameters | Example |
---|---|---|---|
/listPairs | List available language pairs | None | $ curl http://localhost:2737/listPairs {"responseStatus": 200, "responseData": [ {"sourceLanguage": "kaz", "targetLanguage": "tat"}, {"sourceLanguage": "tat", "targetLanguage": "kaz"}, {"sourceLanguage": "mk", "targetLanguage": "en"} ], "responseDetails": null} |
/list | List available mode information |
|
$ curl http://localhost:2737/list?q=analyzers {"mk-en": "mk-en-morph", "en-es": "en-es-anmor", "kaz-tat": "kaz-tat-morph", "tat-kaz": "tat-kaz-morph", "fin": "fin-morph", "es-en": "es-en-anmor", "kaz": "kaz-morph"} $ curl http://localhost:2737/list?q=generators {"en-es": "en-es-generador", "fin": "fin-gener", "es-en": "es-en-generador"} $ curl http://localhost:2737/list?q=taggers {"es-en": "es-en-tagger", "en-es": "en-es-tagger", "mk-en": "mk-en-tagger", "tat-kaz": "tat-kaz-tagger", "kaz-tat": "kaz-tat-tagger", "kaz": "kaz-tagger"} |
/translate | Translate text |
|
$ curl 'http://localhost:2737/translate?langpair=kaz|tat&q=Сен+бардың+ба?' output |
/analyze | Morphologically analyze text |
|
$ curl --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze [["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "], ["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"], ["?/?<sent>","?"],["./.<sent>",".\n"]] |
/generate | Generate surface forms from text |
|
$ curl --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$" http://localhost:2737/generate [["сен","^сен<v><tv><imp><p2><sg>$ "]] |
/perWord | Perform morphological tasks per word |
|
$ curl "http://localhost:2737/perWord?lang=en-es&modes=morph&q=light" [{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "input": "light"}] $ curl "http://localhost:2737/perWord?lang=en-es&modes=tagger&q=light" [{"analyses": ["light<adj><sint>"], "input": "light"}] $ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans&q=light" [{"input": "light", "translations": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] $ curl "http://localhost:2737/perWord?lang=en-es&modes=translate&q=light" [{"input": "light", "translations": ["ligero<adj>"]}] $ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans+morph&q=light" [{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "input": "light", "translations": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}] $ curl "http://localhost:2737/perWord?lang=en-es&modes=translate+tagger&q=light" [{"analyses": ["light<adj><sint>"], "input": "light", "translations": ["ligero<adj>"]}] $ curl "http://localhost:2737/perWord?lang=en-es&modes=morph+tagger&q=light" [{"ambiguousAnalyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "input": "light", "disambiguatedAnalyses": ["light<adj><sint>"]}] |
Threading
Currently it uses TCPServer inheriting ThreadingMixIn. A lock on translateNULFlush (which has to have at most one thread per pipeline) ensures that part stays single-threaded (to avoid Alice getting Bob's text).
Try it out
Try testing with e.g.
export APERTIUMPATH="/path/to/svn/trunk" python3 servlet "$APERTIUMPATH" 2737 & curl -s --data-urlencode 'langpair=nb|nn' --data-urlencode \ 'q@/tmp/reallybigfile' 'http://localhost:2737/translate' >/tmp/output & curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den' curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den' curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den'
And see how the last three (after a slight wait) start outputting before the first request is done.
Morphological Analysis and Generation
To analyze text, send a POST or GET request to /analyze
with parameters mode
and q
set. For example:
$ curl --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze [["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "],["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"],["?/?<sent>","?"],["./.<sent>",".\n"]]
The JSON response will consist of a list of lists each of form [analysis with following non-analyzed text*, original input token]
. To receive a list of valid analyzer modes, send a request to /listAnalyzers
.
To generate surface forms from an analysis, send a POST or GET request to /generate
with parameters mode
and q
set. For example:
$ curl --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$+^сен<v><tv><imp><p2><pl>$" http://localhost:2737/generate [["сен ","^сен<v><tv><imp><p2><sg>$ "],["сеніңдер","^сен<v><tv><imp><p2><pl>$"]]
The JSON response will consist of a list of lists each of form [generated form with following non-analyzed text*, original lexical unit input]
. To receive a list of valid generator modes, send a request to /listGenerators
.
* e.g. whitespace, superblanks
SSL
To test with a self-signed signature:
openssl req -new -x509 -keyout server.pem -out server.pem -days 365 -nodes
Then run with --ssl server.pem, and test with https and the -k argument to curl (-k means curl accepts self-signed or even slightly "lying" signatures):
curl -k --data "mode=kaz-tat&q=Сен+бардың+ба?" https://localhost:2737/analyze
If you have a signed signature for e.g. apache, it's likely to be split into two files, one .key and one .crt. You can cat them together into one to use with servlet.py:
cat server.key server.crt > server.keycrt
Now you should be able to use curl without -k for the domain which the certificate is signed for:
curl --data "mode=kaz-tat&q=Сен+бардың+ба?" https://oohlookatmeimencrypted.com:2737/analyze
Remember to open port 2737 to your server.
TODO
- It should be possible to set a time-out for translation threads, so if a translation is taking too long, it gets killed and the queue moves along.
- It should use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running.
- http://stackoverflow.com/a/487281/69663 recommends select/polling over threading (http://docs.python.org/3.3/library/socketserver.html for more on the differences) but requires either lots of manually written dispatching code (http://pymotw.com/2/select/) or a framework like Twisted.
- some language pairs still don't work (sme-nob?)
- hfst-proc -g doesn't work with null-flushing (or?)