Difference between revisions of "Apertium-apy"

From Apertium
Jump to navigation Jump to search
Line 15: Line 15:
   
 
== Usage ==
 
== Usage ==
  +
APY supports three types of requests: GET, POST, and JSONP. Using GET/POST are possible only if APY is running on the same server as the client due to cross-site scripting restrictions; however, JSONP requests are permitted in any context and will be useful. Using curl, APY can easily be tested:
  +
<code>
  +
<pre>curl --data "lang=kaz-tat&modes=morph&q=алдым" http://localhost:2737/perWord</pre>
  +
</code>Note that this sends a POST request, using curl or your browser to send a GET request is also possible.
  +
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
 
|-
 
|-
 
! URL
 
! URL
  +
! Function
 
! Parameters
 
! Parameters
 
! Example
 
! Example
 
|-
 
|-
| '''/listPairs''' - List available pairs
+
| '''/listPairs'''
  +
| List available language pairs
 
| None
 
| None
 
| <pre>
 
| <pre>
$ curl 'http://localhost:2737/listPairs'
+
$ curl http://localhost:2737/listPairs
  +
output
 
  +
{"responseStatus": 200, "responseData": [
  +
{"sourceLanguage": "kaz", "targetLanguage": "tat"},
  +
{"sourceLanguage": "tat", "targetLanguage": "kaz"},
  +
{"sourceLanguage": "mk", "targetLanguage": "en"}
  +
], "responseDetails": null}
 
</pre>
 
</pre>
 
|-
 
|-
| '''/list''' - List available mode information
+
| '''/list'''
  +
| List available mode information
 
|
 
|
 
*'''q''': type of information to list
 
*'''q''': type of information to list
Line 36: Line 49:
 
** taggers/disambiguators
 
** taggers/disambiguators
 
| <pre>
 
| <pre>
$ curl 'http://localhost:2737/list?q=analyzers'
+
$ curl http://localhost:2737/list?q=analyzers
  +
{"mk-en": "mk-en-morph", "en-es": "en-es-anmor", "kaz-tat": "kaz-tat-morph",
output
 
  +
"tat-kaz": "tat-kaz-morph", "fin": "fin-morph", "es-en": "es-en-anmor", "kaz": "kaz-morph"}
 
</pre>
 
</pre>
 
<pre>
 
<pre>
$ curl 'http://localhost:2737/list?q=generators'
+
$ curl http://localhost:2737/list?q=generators
  +
{"en-es": "en-es-generador", "fin": "fin-gener", "es-en": "es-en-generador"}
output
 
 
</pre>
 
</pre>
 
<pre>
 
<pre>
$ curl 'http://localhost:2737/list?q=taggers'
+
$ curl http://localhost:2737/list?q=taggers
  +
{"es-en": "es-en-tagger", "en-es": "en-es-tagger", "mk-en": "mk-en-tagger",
output
 
  +
"tat-kaz": "tat-kaz-tagger", "kaz-tat": "kaz-tat-tagger", "kaz": "kaz-tagger"}
 
</pre>
 
</pre>
 
|-
 
|-
| '''/translate''' - Translate text
+
| '''/translate'''
  +
| Translate text
 
|
 
|
 
*'''langpair''': language pair to use for translation
 
*'''langpair''': language pair to use for translation
Line 57: Line 73:
 
</pre>
 
</pre>
 
|-
 
|-
| '''/analyze''' - Morphologically analyze text
+
| '''/analyze'''
  +
| Morphologically analyze text
 
|
 
|
 
*'''mode''': language to use for analysis
 
*'''mode''': language to use for analysis
 
*'''q''': text to analyze
 
*'''q''': text to analyze
| <pre>
+
|
  +
<pre>
 
$ curl --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze
 
$ curl --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze
 
[["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "],
 
[["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "],
["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"],
+
["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"],
["?/?<sent>","?"],["./.<sent>",".\n"]]
+
["?/?<sent>","?"],["./.<sent>",".\n"]]
 
</pre>
 
</pre>
 
|-
 
|-
| '''/generate''' - Generate surface forms from text
+
| '''/generate'''
  +
| Generate surface forms from text
 
|
 
|
 
*'''mode''': language to use for generation
 
*'''mode''': language to use for generation
 
*'''q''': text to generate
 
*'''q''': text to generate
 
| <pre>
 
| <pre>
$ curl --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$+^сен<v><tv><imp><p2><pl>$" http://localhost:2737/generate
+
$ curl --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$" http://localhost:2737/generate
[["сен ","^сен<v><tv><imp><p2><sg>$ "],["сеніңдер","^сен<v><tv><imp><p2><pl>$"]]
+
[["сен","^сен<v><tv><imp><p2><sg>$ "]]
 
</pre>
 
</pre>
 
|-
 
|-
| '''/perWord''' - Perform morphological tasks per word
+
| '''/perWord'''
  +
| Perform morphological tasks per word
 
|
 
|
 
*'''language''': language to use for tasks
 
*'''language''': language to use for tasks
Line 88: Line 108:
 
** morph+tagger/morph+disambig (in any order)
 
** morph+tagger/morph+disambig (in any order)
 
*'''q''': text to perform tasks on
 
*'''q''': text to perform tasks on
| <pre></pre>
+
| <pre>
  +
$ curl "http://localhost:2737/perWord?lang=en-es&modes=morph&q=light"
  +
[{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "input": "light"}]
  +
  +
$ curl "http://localhost:2737/perWord?lang=en-es&modes=tagger&q=light"
  +
[{"analyses": ["light<adj><sint>"], "input": "light"}]
  +
  +
$ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans&q=light"
  +
[{"input": "light", "translations": ["luz<n><f><sg>",
  +
"ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}]
  +
  +
$ curl "http://localhost:2737/perWord?lang=en-es&modes=translate&q=light"
  +
[{"input": "light", "translations": ["ligero<adj>"]}]
  +
  +
$ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans+morph&q=light"
  +
[{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"],
  +
"input": "light", "translations": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>",
  +
"encender<vblex><pres>"]}]
  +
  +
$ curl "http://localhost:2737/perWord?lang=en-es&modes=translate+tagger&q=light"
  +
[{"analyses": ["light<adj><sint>"], "input": "light", "translations": ["ligero<adj>"]}]
  +
  +
$ curl "http://localhost:2737/perWord?lang=en-es&modes=morph+tagger&q=light"
  +
[{"ambiguousAnalyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"],
  +
"input": "light", "disambiguatedAnalyses": ["light<adj><sint>"]}]
  +
  +
</pre>
 
|-
 
|-
 
|}
 
|}

Revision as of 01:31, 19 December 2013

Apertium-APy stands for "Apertium API in Python". It's a simple apertium API server written in python, meant as a drop-in replacement for ScaleMT. It is currently found in the svn under trunk/apertium-tools/apertium-apy, where servlet.py is basically its entirety. This is meant for front ends like the simple one in trunk/apertium-tools/simple-html (where index.html is the main deal).

Installation

First, compile and install apertium/lttoolbox/apertium-lex-tools, and compile your language pairs. See Minimal_installation_from_SVN for how to do this. Then

svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-apy
cd apertium-apy
export APERTIUMPATH="/path/to/apertium/svn/trunk"
./servlet.py "$APERTIUMPATH"

Optional arguments include:

  • --langNamesDB: path to database of localized language names
  • -port --port: port to run server on (2737 by default)
  • --ssl: path to SSL certificate

Usage

APY supports three types of requests: GET, POST, and JSONP. Using GET/POST are possible only if APY is running on the same server as the client due to cross-site scripting restrictions; however, JSONP requests are permitted in any context and will be useful. Using curl, APY can easily be tested:

curl --data "lang=kaz-tat&modes=morph&q=алдым" http://localhost:2737/perWord

Note that this sends a POST request, using curl or your browser to send a GET request is also possible.

URL Function Parameters Example
/listPairs List available language pairs None
$ curl http://localhost:2737/listPairs

{"responseStatus": 200, "responseData": [
 {"sourceLanguage": "kaz", "targetLanguage": "tat"}, 
 {"sourceLanguage": "tat", "targetLanguage": "kaz"}, 
 {"sourceLanguage": "mk", "targetLanguage": "en"}
], "responseDetails": null}
/list List available mode information
  • q: type of information to list
    • pairs (alias for /listPairs)
    • analyzers/analysers
    • generators
    • taggers/disambiguators
$ curl http://localhost:2737/list?q=analyzers
{"mk-en": "mk-en-morph", "en-es": "en-es-anmor", "kaz-tat": "kaz-tat-morph", 
 "tat-kaz": "tat-kaz-morph", "fin": "fin-morph", "es-en": "es-en-anmor", "kaz": "kaz-morph"}
$ curl http://localhost:2737/list?q=generators
{"en-es": "en-es-generador", "fin": "fin-gener", "es-en": "es-en-generador"}
$ curl http://localhost:2737/list?q=taggers
{"es-en": "es-en-tagger", "en-es": "en-es-tagger", "mk-en": "mk-en-tagger",
 "tat-kaz": "tat-kaz-tagger", "kaz-tat": "kaz-tat-tagger", "kaz": "kaz-tagger"}
/translate Translate text
  • langpair: language pair to use for translation
  • q: text to translate
$ curl 'http://localhost:2737/translate?langpair=kaz|tat&q=Сен+бардың+ба?'
output
/analyze Morphologically analyze text
  • mode: language to use for analysis
  • q: text to analyze
$ curl --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze
[["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "],
 ["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"],
 ["?/?<sent>","?"],["./.<sent>",".\n"]]
/generate Generate surface forms from text
  • mode: language to use for generation
  • q: text to generate
$ curl --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$" http://localhost:2737/generate
[["сен","^сен<v><tv><imp><p2><sg>$ "]]
/perWord Perform morphological tasks per word
  • language: language to use for tasks
  • modes: morphological tasks to perform on text
    • tagger/disambig
    • biltrans
    • translate
    • biltrans+morph (in any order)
    • translate+tagger (in any order)
    • morph+tagger/morph+disambig (in any order)
  • q: text to perform tasks on
$ curl "http://localhost:2737/perWord?lang=en-es&modes=morph&q=light"
[{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], "input": "light"}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=tagger&q=light"
[{"analyses": ["light<adj><sint>"], "input": "light"}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans&q=light"
[{"input": "light", "translations": ["luz<n><f><sg>", 
 "ligero<adj>", "encender<vblex><inf>", "encender<vblex><pres>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=translate&q=light"
[{"input": "light", "translations": ["ligero<adj>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=biltrans+morph&q=light"
[{"analyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"],
 "input": "light", "translations": ["luz<n><f><sg>", "ligero<adj>", "encender<vblex><inf>",
 "encender<vblex><pres>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=translate+tagger&q=light"
[{"analyses": ["light<adj><sint>"], "input": "light", "translations": ["ligero<adj>"]}]

$ curl "http://localhost:2737/perWord?lang=en-es&modes=morph+tagger&q=light"
[{"ambiguousAnalyses": ["light<n><sg>", "light<adj><sint>", "light<vblex><inf>", "light<vblex><pres>"], 
"input": "light", "disambiguatedAnalyses": ["light<adj><sint>"]}]

Threading

Currently it uses TCPServer inheriting ThreadingMixIn. A lock on translateNULFlush (which has to have at most one thread per pipeline) ensures that part stays single-threaded (to avoid Alice getting Bob's text).

Try it out

Try testing with e.g.

   export APERTIUMPATH="/path/to/svn/trunk"
   python3 servlet "$APERTIUMPATH" 2737 &
   
   curl -s --data-urlencode 'langpair=nb|nn' --data-urlencode \
   'q@/tmp/reallybigfile' 'http://localhost:2737/translate' >/tmp/output &
   
   curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den'
   curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den'
   curl 'http://localhost:2737/translate?langpair=nb%7Cnn&q=men+ikke+den'
   

And see how the last three (after a slight wait) start outputting before the first request is done.

Morphological Analysis and Generation

To analyze text, send a POST or GET request to /analyze with parameters mode and q set. For example:

   $ curl --data "mode=kaz&q=Сен+бардың+ба?" http://localhost:2737/analyze
   [["Сен/сен<v><tv><imp><p2><sg>/сен<prn><pers><p2><sg><nom>","Сен "],["бардың ба/бар<adj><subst><gen>+ма<qst>/бар<v><iv><ifi><p2><sg>+ма<qst>","бардың ба"],["?/?<sent>","?"],["./.<sent>",".\n"]]

The JSON response will consist of a list of lists each of form [analysis with following non-analyzed text*, original input token]. To receive a list of valid analyzer modes, send a request to /listAnalyzers.

To generate surface forms from an analysis, send a POST or GET request to /generate with parameters mode and q set. For example:

   $ curl --data "mode=kaz&q=^сен<v><tv><imp><p2><sg>$+^сен<v><tv><imp><p2><pl>$" http://localhost:2737/generate
   [["сен ","^сен<v><tv><imp><p2><sg>$ "],["сеніңдер","^сен<v><tv><imp><p2><pl>$"]]

The JSON response will consist of a list of lists each of form [generated form with following non-analyzed text*, original lexical unit input]. To receive a list of valid generator modes, send a request to /listGenerators.

* e.g. whitespace, superblanks


SSL

To test with a self-signed signature:

openssl req -new -x509 -keyout server.pem -out server.pem -days 365 -nodes

Then run with --ssl server.pem, and test with https and the -k argument to curl (-k means curl accepts self-signed or even slightly "lying" signatures):

curl -k --data "mode=kaz-tat&q=Сен+бардың+ба?" https://localhost:2737/analyze


If you have a signed signature for e.g. apache, it's likely to be split into two files, one .key and one .crt. You can cat them together into one to use with servlet.py:

cat server.key server.crt > server.keycrt

Now you should be able to use curl without -k for the domain which the certificate is signed for:

curl --data "mode=kaz-tat&q=Сен+бардың+ба?" https://oohlookatmeimencrypted.com:2737/analyze

Remember to open port 2737 to your server.

TODO

  • It should be possible to set a time-out for translation threads, so if a translation is taking too long, it gets killed and the queue moves along.
  • It should use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running.
  • http://stackoverflow.com/a/487281/69663 recommends select/polling over threading (http://docs.python.org/3.3/library/socketserver.html for more on the differences) but requires either lots of manually written dispatching code (http://pymotw.com/2/select/) or a framework like Twisted.
  • some language pairs still don't work (sme-nob?)
  • hfst-proc -g doesn't work with null-flushing (or?)