GSOC'16 Kira's results. Apertium website improvements: Docs diff
Dictionary Lookup mode
Back-end:
URL | Function | Parameters | Output |
---|---|---|---|
/dictionaryLookup | Generate all possible forms of a word. |
|
Returns all possible forms
curl -G --data "langpair=eng|spa&q=run" http://localhost:2737/dictionaryLookup {"n": ["carrera"], "vblex": ["correr", "funcionar"]} |
Front-end:
ENABLED_MODES: an array of the enabled interfaces, a non-empty subset of ['translation', 'analyzation', 'generation', 'sandbox', 'lookup']
translation lookup
turns on dictionary lookup mode.
Language detection
Back-end:
New language detection library uses the same query format as the previous.
Description can be found here: /identifyLang (http://wiki.apertium.org/wiki/Apertium-apy)
How to train a new language model:
1. Install Langdetect library (https://github.com/Mimino666/langdetect).
$ pip install langdetect
Supported Python versions 2.6, 2.7, 3.x.
2. Prepare the training data.
For instant, using Wikipedia dumps (http://wiki.apertium.org/wiki/Wikipedia_Extractor)
3. Train the model (https://github.com/Mimino666/langdetect#how-to-add-new-language)
You need to create a new language profile. The easiest way to do it is to use the langdetect.jar tool, which can generate language profiles from Wikipedia abstract database files or plain text.
Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" (http://download.wikimedia.org/). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).
usage: java -jar langdetect.jar --genprofile -d [directory path] [language codes]
- Specify the directory which has abstract databases by -d option.
- This tool can handle gzip compressed file.
Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.
To generate language profile from a plain text, use the genprofile-text command.
usage: java -jar langdetect.jar --genprofile-text -l [language code] [text file path]
For more details see language-detection Wiki: https://code.google.com/archive/p/language-detection/wikis/Tools.wiki.
4. Locate the folder where Langdetect is installed
5. Copy the new language model to the Profiles folder
cp [options] /usr/local/lib/python3.4/dist-packages/langdetect/profiles/
How to check installed language models
from langdetect import detector_factory
detector_factory.init_factory()
print(detector_factory._factory.langlist)
New models trained for Apertium are available here: https://github.com/Kira-D/apertium-apy/tree/detectLanguage/models