Difference between revisions of "Shallow syntactic function labeller"
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) |
||
Line 36: | Line 36: | ||
|- |
|- |
||
| |
| Kurmanji || 84% |
||
|- |
|- |
||
Revision as of 08:24, 17 August 2017
This is Google Summer of Code 2017 project
A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller
A workplan and progress notes can be found here: Shallow syntactic function labeller/Workplan
Contents
Description
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.
Labeller in the pipeline
The labeller runs between morphological analyzer or disambiguator and pretransfer.
For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.
... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...
Language pairs support
Currently the labeller works with following language pairs:
- sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)
- kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels
Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.
Labelling performance
The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).
Language | Accuracy |
---|---|
North Sami | 81,6% |
Kurmanji | 84% |
Breton | 79,7% |
Kazakh | 82,6% |
English | 79,8% |
Installation
Prerequisites
1. Python libraries:
- DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)
- Streamparser (https://github.com/goavki/streamparser)
2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)
How to install a testpack
Currently only two testpacks are available:
- sme-nob: https://github.com/deltamachine/sfl_sme_testpack.git
- kmr-eng: https://github.com/deltamachine/sfl_kmr_testpack.git
git clone https://github.com/deltamachine/sfl_sme_testpack.git cd sfl_sme_testpack
Script install_labeller.py adds all the needed files in language pair directory and changes all files with modes.
Arguments:
- apertium_path: path to your apertium-sme-nob directory
- python_path: path to current Python interpreteur (NB: if you just type "python" instead of full path, some dependencies might not work)
- work_mode: -install for installing the labeller and changing modes, -change for just changing modes.
- type_of_change: -lb for using the labeller in the pipeline, -cg for using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.
For example, this script will install the labeller and add it to the pipeline:
python install_labeller.py /home/user/apertium/apertium-sme-nob /home/user/anaconda3/bin/python -install -lb
And this script will backward modes changes:
python install_labeller.py /home/user/apertium/apertium-sme-nob /home/user/anaconda3/bin/python -change -cg
To do
Add an ability to handle more than one sentence.- Do more tests. MORE.
- Write docstrings and refactore the main code.
- Take the trash out of the github repository before the final evaluation.
- Continue improving the perfomance of the models.