GSOC'16 Kira's results. Apertium website improvements: Docs diff

Dictionary Lookup mode

Back-end:

URL	Function	Parameters	Output
/dictionaryLookup	Generate all possible forms of a word.	langpair: language pair to use for translation q: word to perform task on	Returns all possible forms curl -G --data "langpair=eng\|spa&q=run" http://localhost:2737/dictionaryLookup {"n": ["carrera"], "vblex": ["correr", "funcionar"]}

Front-end:

ENABLED_MODES: an array of the enabled interfaces, a non-empty subset of ['translation', 'analyzation', 'generation', 'sandbox', 'lookup']

translation lookup turns on dictionary lookup mode.

Language detection

Back-end:

New language detection library uses the same query format as the previous.

Description can be found here: /identifyLang (http://wiki.apertium.org/wiki/Apertium-apy)

How to train a new language model:

1. Install Langdetect library (https://github.com/Mimino666/langdetect).

$ pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

2. Prepare the training data.

For instant, using Wikipedia dumps (http://wiki.apertium.org/wiki/Wikipedia_Extractor)

3. Train the model (https://github.com/Mimino666/langdetect#how-to-add-new-language)

You need to create a new language profile. The easiest way to do it is to use the langdetect.jar tool, which can generate language profiles from Wikipedia abstract database files or plain text.

Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" (http://download.wikimedia.org/). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).

usage: java -jar langdetect.jar --genprofile -d [directory path] [language codes]

Specify the directory which has abstract databases by -d option.
This tool can handle gzip compressed file.

Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.

To generate language profile from a plain text, use the genprofile-text command.

usage: java -jar langdetect.jar --genprofile-text -l [language code] [text file path]

For more details see language-detection Wiki: https://code.google.com/archive/p/language-detection/wikis/Tools.wiki.

4. Locate the folder where Langdetect is installed

5. Copy the new language model to the Profiles folder

 cp [options] /usr/local/lib/python3.4/dist-packages/langdetect/profiles/

How to check installed language models

from langdetect import detector_factory 

detector_factory.init_factory()

print(detector_factory._factory.langlist)

New models trained for Apertium are available here: https://github.com/Kira-D/apertium-apy/tree/detectLanguage/models

Suggestions

Back-end:

use ./servlet.py /usr/local/share/apertium/ --wiki-username=WikiUsername --wiki-password=WikiPassword -rs=YourRecaptchaSecret to run apy in google reCaptcha mode

-b --bypass-token: testing token is generated to bypass recaptcha

URL	Function	Parameters	Output
/suggest	Generate a suggestion on target wiki-page using a testing token.	context: sentence word: word that will be sugested newWord: suggestion langpair: language pair to use for translation g-recaptcha-response: testing token generated when running apy (note that only testing token can be used with curl)	Returns the status. If "Success", the suggestion is posted on the target wiki-page. Note that the correct wiki-page url is required (wiki_util.py) For production usage of Google reCaptcha the registration is required (https://developers.google.com/recaptcha/). Note that correct keys are required when starting apy and in the html-tools config file. curl --data 'context=otro+mundo&word=*mundo&newWord=MUNDO&langpair=esp\|eng&g-recaptcha-response=testingToken' http://localhost:2737/suggest {"responseStatus": 200, "responseData": {"status": "Success"}, "responseDetails": null}

Front-end:

ENABLED: turns on the suggestion mode (True/False)
RECAPTCHA_SITE_KEY: recaptcha site key which can be obtained by registration at https://developers.google.com/recaptcha/
CONTEXT_WRAP: a number of context words from the left

Speller

Back-end:

URL	Function	Parameters	Output
/speller	Performs spellchecking for a given language.	lang: language to spellcheck for q: text to perform task on	Returns results of spellchecking curl -G --data "lang=kaa&q=o'shiriledi" http://localhost:2737/speller [{"token": "o'shiriledi", "sugg": [["keshiriledi", "2.000000"]], "known": false}]

Front-end:

ENABLED_MODES: an array of the enabled interfaces, a non-empty subset of ['translation', 'analyzation', 'generation', 'sandbox', 'speller']

speller turns on spell checking mode.

GSOC'16 Kira's results. Apertium website improvements: Docs diff

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools