Difference between revisions of "Shallow syntactic function labeller"

From Apertium
Jump to navigation Jump to search
 
(14 intermediate revisions by one other user not shown)
Line 4: Line 4:


A workplan and progress notes can be found here: [[Shallow syntactic function labeller/Workplan]]
A workplan and progress notes can be found here: [[Shallow syntactic function labeller/Workplan]]

== What was done ==
1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).

2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.

3. The labeller itself was created. Also the testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller and installation script.

== List of commits ==
All commits are listed below:

https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master


== Description ==
== Description ==
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.


== Labeller in the pipeline ==
=== Labeller in the pipeline ===
In sme-nob the labeller runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.
The labeller runs between morphological analyzer or disambiguator and pretransfer.

For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.


<pre>
<pre>
Line 15: Line 29:
</pre>
</pre>


=== Language pairs support ===
In other language pairs it may run between morphological analyzer and pretransfer.
Currently the labeller works with following language pairs:
* sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)
* kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels


Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.
== Prerequisites ==

=== Labelling performance ===
The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).

{|class=wikitable
|-
! Language !! Accuracy
|-
| North Sami || 81,6%
|-

|-
| Kurmanji || 84%
|-

|-
| Breton || 79,7%
|-

|-
| Kazakh || 82,6%
|-

|-
| English || 79,8%
|-
|}

== Installation ==

=== Prerequisites ===
1. Python libraries:
1. Python libraries:
* DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)
* DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)
* Streamparser (https://github.com/goavki/streamparser)
* Streamparser (https://github.com/apertium/streamparser)


2. Precompiled language pairs which support the labeller (sme-nob)
2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)


=== How to install a testpack ===
== Installation ==
'''Currently only the test version for sme-nob pair is available.'''
NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.


<pre>
<pre>
git clone https://github.com/deltamachine/sme-nob_testpack.git
git clone https://github.com/deltamachine/sfl_testpack.git
cd sfl_testpack
cd sme-nob_testpack
</pre>
</pre>


Script ''install_labeller.py'' adds all the needed files in apertium-sme-nob directory and changes all files with modes.
Script ''setup.py'' adds all the needed files in language pair directory and changes all files with modes.


'''Arguments:'''
'''Arguments:'''
* ''work_mode:'' '''-lb''' for installing the labeller and changing modes, '''-cg''' for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.
* ''apertium_path:'' path to your apertium-sme-nob directory
* ''lang:'' '''-sme''' for installing/uninstalling the labeller only for sme-nob, '''-kmr''' - only for kmr-eng, '''-all''' - for both.
* ''python_path:'' path to current Python interpreteur (NB: if you just type "python" instead of full path, some dependencies might not work)
* ''work_mode:'' '''-install''' for installing the labeller and changing modes, '''-change''' for just changing modes.
* ''type_of_change:'' '''-lb''' for using the labeller in the pipeline, '''-cg''' for using the original syntax module (sme-nob.syn.rlx.bin) in the pipeline.


For example, this script will install the labeller and add it to the pipeline for both pairs:
<pre>
python setup.py -lb -all
</pre>


For example, this script will install the labeller and add it to the pipeline:
And this script will backward modes changes for sme-nob:
<pre>
<pre>
python setup.py -cg -sme
python install_labeller.py /home/user/apertium/apertium-sme-nob /home/user/anaconda3/bin/python -install -lb
</pre>
</pre>


== Bugs ==
And this script will backward modes changes:
1. <s>Installation script changes eng-kmr pipeline along with kmr-eng</s>

2. <s>Problems with tags order (syntactic label is not the last tag)</s>

3. <s>Words-without-a-label bug</s>
<pre>
<pre>
<spectre> is it possible that some words don't get a label ?
python install_labeller.py /home/user/apertium/apertium-sme-nob /home/user/anaconda3/bin/python -change -cg
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$
</pre>
</pre>

2 and 3 seem to be fixed, but it should be checked carefully.


== To do ==
== To do ==
* <s>Add an ability to handle more than one sentence.</s>
* Do more tests. MORE.
* Do more tests. MORE.
* '''Fix bugs'''
* Write docstrings and refactore the main code.
* Refactore the main code.
* Take the trash out of the github repository before the final evaluation.
* Continue improving the perfomance of the models.
* '''Continue improving the perfomance of the models.'''

Latest revision as of 01:40, 8 March 2018

This is Google Summer of Code 2017 project

A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller

A workplan and progress notes can be found here: Shallow syntactic function labeller/Workplan

What was done[edit]

1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).

2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.

3. The labeller itself was created. Also the testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller and installation script.

List of commits[edit]

All commits are listed below:

https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master

Description[edit]

The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.

Labeller in the pipeline[edit]

The labeller runs between morphological analyzer or disambiguator and pretransfer.

For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.

... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...

Language pairs support[edit]

Currently the labeller works with following language pairs:

  • sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)
  • kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels

Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.

Labelling performance[edit]

The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).

Language Accuracy
North Sami 81,6%
Kurmanji 84%
Breton 79,7%
Kazakh 82,6%
English 79,8%

Installation[edit]

Prerequisites[edit]

1. Python libraries:

2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)

How to install a testpack[edit]

NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.

git clone https://github.com/deltamachine/sfl_testpack.git
cd sfl_testpack

Script setup.py adds all the needed files in language pair directory and changes all files with modes.

Arguments:

  • work_mode: -lb for installing the labeller and changing modes, -cg for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.
  • lang: -sme for installing/uninstalling the labeller only for sme-nob, -kmr - only for kmr-eng, -all - for both.

For example, this script will install the labeller and add it to the pipeline for both pairs:

python setup.py -lb -all

And this script will backward modes changes for sme-nob:

python setup.py -cg -sme

Bugs[edit]

1. Installation script changes eng-kmr pipeline along with kmr-eng

2. Problems with tags order (syntactic label is not the last tag)

3. Words-without-a-label bug

<spectre> is it possible that some words don't get a label ?
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$ 
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$

2 and 3 seem to be fixed, but it should be checked carefully.

To do[edit]

  • Do more tests. MORE.
  • Fix bugs
  • Refactore the main code.
  • Continue improving the perfomance of the models.