Difference between revisions of "Apertium-apy"

From Apertium
Jump to navigation Jump to search
(modify SSL docs)
Line 104: Line 104:
 
|
 
|
 
*'''language''': language to use for tasks
 
*'''language''': language to use for tasks
*'''modes''': morphological tasks to perform on text
+
*'''modes''': morphological tasks to perform on text (15 combinations possible - delimit each using a space)
 
** tagger/disambig
 
** tagger/disambig
 
** biltrans
 
** biltrans
 
** translate
 
** translate
** biltrans+morph (in any order)
 
** translate+tagger (in any order)
 
** morph+tagger/morph+disambig (in any order)
 
 
*'''q''': text to perform tasks on
 
*'''q''': text to perform tasks on
 
| <pre>
 
| <pre>

Revision as of 04:21, 20 December 2013

Apertium-APy stands for "Apertium API in Python". It's a simple apertium API server written in python, meant as a drop-in replacement for ScaleMT. It is currently found in the svn under trunk/apertium-tools/apertium-apy, where servlet.py is basically its entirety. This is meant for front ends like the simple one in trunk/apertium-tools/simple-html (where index.html is the main deal).

Installation

First, compile and install apertium/lttoolbox/apertium-lex-tools, and compile your language pairs. See Minimal_installation_from_SVN for how to do this. APY uses Tornado as its web framework, install it via pip install tornado or other variants depending on your environment. Then checkout APY from SVN and run it:

svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-apy
cd apertium-apy
export APERTIUMPATH="/path/to/apertium/svn/trunk"
./servlet.py "$APERTIUMPATH"

Optional arguments include:

  • -l --langNames: path to database of localized language names
  • -p --port: port to run server on (2737 by default)
  • -c --sslCert: path to SSL certificate
  • -k --sslKey: path to SSL key file

Usage

APY supports three types of requests: GET, POST, and JSONP. Using GET/POST are possible only if APY is running on the same server as the client due to cross-site scripting restrictions; however, JSONP requests are permitted in any context and will be useful. Using curl, APY can easily be tested:

curl -G --data "lang=kaz-tat&modes=morph&q=алдым" http://localhost:2737/perWord

It can also be tested through your browser or through HTTP calls.

URL Function Parameters Example
/listPairs List available language pairs None
$ curl http://localhost:2737/listPairs

{"responseStatus": 200, "responseData": [
 {"sourceLanguage": "kaz", "targetLanguage": "tat"}, 
 {"sourceLanguage": "tat", "targetLanguage": "kaz"}, 
 {"sourceLanguage": "mk", "targetLanguage": "en"}
], "responseDetails": null}
/list List available mode information
  • q: type of information to list
    • pairs (alias for /listPairs)
    • analyzers/analysers
    • generators
    • taggers/disambiguators
$ curl http://localhost:2737/list?q=analyzers
{"mk-en": "mk-en-morph", "en-es": "en-es-anmor", "kaz-tat": "kaz-tat-morph", 
 "tat-kaz": "tat-kaz-morph", "fin": "fin-morph", "es-en": "es-en-anmor", "kaz": "kaz-morph"}
$ curl http://localhost:2737/list?q=generators
{"en-es": "en-es-generador", "fin": "fin-gener", "es-en": "es-en-generador"}
$ curl http://localhost:2737/list?q=taggers
{"es-en": "es-en-tagger", "en-es": "en-es-tagger", "mk-en": "mk-en-tagger",
 "tat-kaz": "tat-kaz-tagger", "kaz-tat": "kaz-tat-tagger", "kaz": "kaz-tagger"}
/translate Translate text
  • langpair: language pair to use for translation
  • q: text to translate
$ curl 'http://localhost:2737/translate?langpair=kaz|tat&q=Сен+бардың+ба?'
output
/analyze Morphologically analyze text
  • mode: language to use for analysis
  • q: text to analyze
$ curl -G --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze
[["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "],
 ["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"],
 ["?/?<sent>","?"],["./.<sent>",".\n"]]
/generate Generate surface forms from text
  • mode: language to use for generation
  • q: text to generate
$ curl -G --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$" http://localhost:2737/generate
[["сен","^сен<v><tv><imp><p2><sg>$ "]]
/perWord Perform morphological tasks per word
  • language: language to use for tasks
  • modes: morphological tasks to perform on text (15 combinations possible - delimit each using a space)
    • tagger/disambig
    • biltrans
    • translate
  • q: text to perform tasks on
$ curl "http://localhost:2737/perWord?lang=en-es&modes=morph&q=light"
[{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "input": "light"}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=tagger&q=light"
[{"analyses": ["light<adj><sint>"], "input": "light"}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans&q=light"
[{"input": "light", "translations": ["luz<n><f><sg>", 
 "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=translate&q=light"
[{"input": "light", "translations": ["ligero<adj>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans+morph&q=light"
[{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"],
 "input": "light", "translations": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>",
 "encender<vblex><pres>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=translate+tagger&q=light"
[{"analyses": ["light<adj><sint>"], "input": "light", "translations": ["ligero<adj>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=morph+tagger&q=light"
[{"ambiguousAnalyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], 
"input": "light", "disambiguatedAnalyses": ["light<adj><sint>"]}]

/listLocalizedLanguageNames Get localized language names
  • locale: language to get localized language names in
  • languages: list of '+' delimited language codes to retrieve localized names for (optional)
$ curl http://localhost:2737/listLanguageNames?locale=fr&languages=ca+en+mk+tat+kk
{"ca": "catalan", "en": "anglais", "kk": "kazakh", "mk": "macédonien", "tat": "tatar"}

SSL

APY supports HTTPS out of the box. To test with a self-signed signature, create a certificate and key by running:

openssl req -new -x509 -keyout server.key -out server.crt -days 365 -nodes

Then run APY with --sslKey server.key --sslCert server.crt, and test with HTTPS and the -k argument to curl (-k means curl accepts self-signed or even slightly "lying" signatures):

curl -k -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://localhost:2737/analyze

If you have a real signed certificate, you should be able to use curl without -k for the domain which the certificate is signed for:

curl -G --data "mode=kaz-tat&q=Сен+бардың+ба?" https://oohlookatmeimencrypted.com:2737/analyze

Remember to open port 2737 to your server.

Threading

Currently it uses TCPServer inheriting ThreadingMixIn. A lock on translateNULFlush (which has to have at most one thread per pipeline) ensures that part stays single-threaded (to avoid Alice getting Bob's text).

Try it out

Try testing with e.g.

   export APERTIUMPATH="/path/to/svn/trunk"
   python3 servlet "$APERTIUMPATH" 2737 &
   
   curl -s --data-urlencode 'langpair=nb|nn' --data-urlencode \
   'q@/tmp/reallybigfile' 'http://localhost:2737/translate' >/tmp/output &
   
   curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den'
   curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den'
   curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den'
   

And see how the last three (after a slight wait) start outputting before the first request is done.

Morphological Analysis and Generation

To analyze text, send a POST or GET request to /analyze with parameters mode and q set. For example:

   $ curl --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze
   [["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "],["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"],["?/?<sent>","?"],["./.<sent>",".\n"]]

The JSON response will consist of a list of lists each of form [analysis with following non-analyzed text*, original input token]. To receive a list of valid analyzer modes, send a request to /listAnalyzers.

To generate surface forms from an analysis, send a POST or GET request to /generate with parameters mode and q set. For example:

   $ curl --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$+^сен<v><tv><imp><p2><pl>$" http://localhost:2737/generate
   [["сен ","^сен<v><tv><imp><p2><sg>$ "],["сеніңдер","^сен<v><tv><imp><p2><pl>$"]]

The JSON response will consist of a list of lists each of form [generated form with following non-analyzed text*, original lexical unit input]. To receive a list of valid generator modes, send a request to /listGenerators.

* e.g. whitespace, superblanks

TODO

  • It should be possible to set a time-out for translation threads, so if a translation is taking too long, it gets killed and the queue moves along.
  • It should use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running.
  • http://stackoverflow.com/a/487281/69663 recommends select/polling over threading (http://docs.python.org/3.3/library/socketserver.html for more on the differences) but requires either lots of manually written dispatching code (http://pymotw.com/2/select/) or a framework like Twisted.
  • some language pairs still don't work (sme-nob?)
  • hfst-proc -g doesn't work with null-flushing (or?)