Difference between revisions of "Shallow syntactic function labeller"

From Apertium
Jump to navigation Jump to search
 
(20 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]
 
This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]
   
A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller
+
A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller
   
  +
A workplan and progress notes can be found here: [[Shallow syntactic function labeller/Workplan]]
== Architecture ==
 
   
  +
== What was done ==
<p>1. The labeller takes a string in Apertium stream format with morphological tags:</p>
 
  +
1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).
   
  +
2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.
<blockquote>
 
^vino<n><m><sg>$ = INPUT
 
</blockquote>
 
   
  +
3. The labeller itself was created. Also the testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller and installation script.
<p>2. Parses it into a sequence of morphological tags:</p>
 
   
  +
== List of commits ==
<blockquote><n><m><sg>
 
  +
All commits are listed below:
</blockquote>
 
   
  +
https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master
<p>3. Restores the model for this language (which is in the same directory and looks like .json file or like a .pkl file)</p>
 
   
  +
== Description ==
<p>4. The algorithm analyzes the string and gives a sequence of syntactic tags as an output.</p>
 
  +
The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.
<blockquote>
 
<@nsubj>
 
</blockquote>
 
<p>5. The labeller applies given labels to the original string:</p>
 
<blockquote>^vino<n><m><sg><@nsubj>$ = OUTPUT
 
</blockquote>
 
<p></p>
 
<p> So, in the end there will be a module itself and a file with a model. </p>
 
   
  +
=== Labeller in the pipeline ===
== Workplan ==
 
  +
The labeller runs between morphological analyzer or disambiguator and pretransfer.
  +
  +
For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.
  +
  +
<pre>
  +
... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...
  +
</pre>
  +
  +
=== Language pairs support ===
  +
Currently the labeller works with following language pairs:
  +
* sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)
  +
* kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels
  +
  +
Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.
  +
  +
=== Labelling performance ===
  +
The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).
   
 
{|class=wikitable
 
{|class=wikitable
 
|-
 
|-
! Week !! Dates !! To do
+
! Language !! Accuracy
  +
 
|-
 
|-
| 1 || 30th May — 5th June ||
+
| North Sami || 81,6%
* <s>Handling discrepancies between Apertium sme-nob and Sami corpus tagsets</s>
 
* <s>Writing a script for parsing Sami corpus</s>
 
 
|-
 
|-
  +
| 2
 
| 6th June — 12th June
 
| rowspan="2" align=left " | <s>Writing scripts for replacing UD tags with suitable Apertium tags and parsing UD-tree into a dataset for Kazakh, Breton and English UD dependency treebanks</s>
 
 
 
|-
 
|-
| 3
+
| Kurmanji || 84%
  +
|-
| 13th June — 19th June
 
  +
 
 
|-
 
|-
  +
| Breton || 79,7%
| 4 || 20th June — 26th June || <s>Writing scripts for converting UD-treebanks (dev and test) of needed languages in Apertium stream format (converted treebanks will be useful for evaluating the quality of the labeller</s>)
 
 
 
|-
 
|-
  +
! '''First evaluation''' !! colspan="2" align=left |
 
Ready-to-use datasets
 
 
|-
 
|-
| 5
+
| Kazakh || 82,6%
| 27th June — 3rd July
 
| rowspan="2" align=left " | <s>Building and training the classifier</s>
 
 
|-
 
|-
  +
| 6
 
| 4th Jule — 10th July
 
 
|-
 
|-
| 7 || 11th July — 17th July ||
+
| English || 79,8%
* Further training
 
* Working on improvements of the model
 
|-
 
| 8 || 18th July — 24th July ||
 
* Final testing
 
* Writing a script, which applies labels to the original string in Apertium stream format
 
|-
 
!'''Second evaluation''' || colspan="2" align=left |
 
Well-trained model (at least for North Sami)
 
|-
 
| 9 || 25th July — 31th July ||
 
* Collecting all parts of the labeller together
 
* Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
 
|-
 
| 10 || 1st August — 7th August ||
 
* Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
 
|-
 
| 11 || 8th August — 14th August ||
 
* Testing
 
* Fixing bugs
 
|-
 
| 12 || 15th August — 21th August ||
 
* Cleaning up the code
 
* Writing documentation
 
|-
 
!'''Final evaluation''' || colspan="2" align=left |
 
The prototype shallow syntactic function labeller.
 
 
|-
 
|-
  +
 
|}
 
|}
   
== Progress ==
+
== Installation ==
  +
'''Week 1:''' Datasets for North Sami were created.
 
  +
=== Prerequisites ===
* Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
 
  +
1. Python libraries:
* Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
 
  +
* DyNet (installation instructions can be found here: http://dynet.readthedocs.io/en/latest/python.html)
* In cases when there were two lines with analysis for one word, only one analysis has been left.
 
  +
* Streamparser (https://github.com/apertium/streamparser)
* Information about derivation was removed too.
 
  +
* Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
 
  +
2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)
* Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.
 
  +
  +
=== How to install a testpack ===
  +
NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.
  +
  +
<pre>
  +
git clone https://github.com/deltamachine/sfl_testpack.git
  +
cd sfl_testpack
  +
</pre>
  +
  +
Script ''setup.py'' adds all the needed files in language pair directory and changes all files with modes.
  +
  +
'''Arguments:'''
  +
* ''work_mode:'' '''-lb''' for installing the labeller and changing modes, '''-cg''' for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.
  +
* ''lang:'' '''-sme''' for installing/uninstalling the labeller only for sme-nob, '''-kmr''' - only for kmr-eng, '''-all''' - for both.
  +
  +
For example, this script will install the labeller and add it to the pipeline for both pairs:
  +
<pre>
  +
python setup.py -lb -all
  +
</pre>
  +
  +
And this script will backward modes changes for sme-nob:
  +
<pre>
  +
python setup.py -cg -sme
  +
</pre>
  +
  +
== Bugs ==
  +
1. <s>Installation script changes eng-kmr pipeline along with kmr-eng</s>
   
  +
2. <s>Problems with tags order (syntactic label is not the last tag)</s>
'''Weeks 2-3:''' Datasets for Kazakh, Breton and English were created.
 
   
  +
3. <s>Words-without-a-label bug</s>
NB: the datasets for North Sami and English seem to be pretty big, when Kazakh is comparably small and Breton is even smaller. But it gives us opportunity to check how many data will be enough for training the labeller and is it possible to achieve pretty good results having very small amount of data (like in case of Breton)
 
  +
<pre>
* All dependency treebanks were "flattened": words with the @conj and the @parataxis relation took the label of their head (https://github.com/deltamachine/wannabe_hackerman/blob/master/flatten_conllu.py).
 
  +
<spectre> is it possible that some words don't get a label ?
* For all languages two types of datasets were created: the first type contains tags for punctuation and the second does not.
 
  +
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger
* ''Kazakh''
 
  +
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$
** some mistakes in conllu file were corrected
 
  +
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$
** double lines were removed
 
  +
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$
* ''English''
 
  +
</pre>
** double lines were removed
 
** all UD POS and features tags were replaced with Apertium tags
 
* ''Breton''
 
** some mistakes in conllu file were corrected
 
** double lines were removed
 
** all UD features tags were replaced with Apertium tags
 
   
  +
2 and 3 seem to be fixed, but it should be checked carefully.
'''Week 4:''' Scripts for converting Kazakh, Breton and English UD-treebanks in Apertium stream format were written
 
   
  +
== To do ==
'''Weeks 5-6'''
 
  +
* Do more tests. MORE.
Two types of networks were built: a simple RNN network and an encoder-decoder network with attentive mechanism.
 
  +
* '''Fix bugs'''
It seems that simple RNN shows better results in all our cases. Encoder-decoder network shows acceptable results only
 
  +
* Refactore the main code.
on very big datasets, like English dataset, but in case of small corpus, like Breton, it is useless.
 
  +
* '''Continue improving the perfomance of the models.'''
Simple RNN can work with small datasets. Here are some results: Breton - 75% of accuracy, Kazakh - 73%, North Sami - 70%, English - 79%. The best result for encoder-decoder was 75% on English (and - about 20% on Breton).
 

Latest revision as of 01:40, 8 March 2018

This is Google Summer of Code 2017 project

A repository for the whole project: https://github.com/deltamachine/shallow_syntactic_function_labeller

A workplan and progress notes can be found here: Shallow syntactic function labeller/Workplan

What was done[edit]

1. All needed data for North Sami, Kurmanji, Breton, Kazakh and English was prepared: there are two scripts, one of which creates datasets from UD treebanks (it is able to handle Kurmanji, Breton, Kazakh and English) and the second creates datasets from VISL treebanks (is able to handle North Sami).

2. Simple RNN, which is able to label sentences, was built. It works with fastText embeddings for every tag which was seen in the corpus: an embedding for a word is just a sum of all word's tags embeddings.

3. The labeller itself was created. Also the testpack for two language pairs was built: it contains all needed data for sme-nob and kmr-eng, the labeller and installation script.

List of commits[edit]

All commits are listed below:

https://github.com/deltamachine/shallow_syntactic_function_labeller/commits/master

Description[edit]

The shallow syntactic function labeller takes a string in Apertium stream format, parses it into a sequence of morphological tags and gives it to a classifier. The classifier is a simple RNN model trained on prepared datasets which were made from parsed syntax-labelled corpora (mostly UD-treebanks). The classifier analyzes the given sequence of morphological tags, gives a sequence of labels as an output and the labeller applies these labels to the original string.

Labeller in the pipeline[edit]

The labeller runs between morphological analyzer or disambiguator and pretransfer.

For example, in sme-nob it runs between sme-nob-disam and sme-nob-pretransfer, like an original syntax module.

... | cg-proc 'sme-nob.mor.rlx.bin' | python 'sme-nob-labeller.py' | apertium-pretransfer | lt-proc -b 'sme-nob.autobil.bin' | ...

Language pairs support[edit]

Currently the labeller works with following language pairs:

  • sme-nob: the labeller may fully replace the original syntax module (it doesn't have all the functionality of the original CG, but works pretty good anyway)
  • kmr-eng: may be tested in the pipeline, but the pair has only a few rules that look at syntax labels

Also there is all the needed data for Breton, Kazakh and English (https://github.com/deltamachine/shallow_syntactic_function_labeller/tree/master/models), but at this moment br-fr, kk-tat and en-ca just don't have syntax rules, so we can not test the labeller.

Labelling performance[edit]

The results of validating the labeller on the test set (accuracy = mean accuracy score on the test set).

Language Accuracy
North Sami 81,6%
Kurmanji 84%
Breton 79,7%
Kazakh 82,6%
English 79,8%

Installation[edit]

Prerequisites[edit]

1. Python libraries:

2. Precompiled language pairs which support the labeller (sme-nob, kmr-eng)

How to install a testpack[edit]

NB: currently the testpack contains syntax modules only for sme-nob and kmr-eng.

git clone https://github.com/deltamachine/sfl_testpack.git
cd sfl_testpack

Script setup.py adds all the needed files in language pair directory and changes all files with modes.

Arguments:

  • work_mode: -lb for installing the labeller and changing modes, -cg for backwarding changes and using the original syntax module (sme-nob.syn.rlx.bin or kmr-eng.prob) in the pipeline.
  • lang: -sme for installing/uninstalling the labeller only for sme-nob, -kmr - only for kmr-eng, -all - for both.

For example, this script will install the labeller and add it to the pipeline for both pairs:

python setup.py -lb -all

And this script will backward modes changes for sme-nob:

python setup.py -cg -sme

Bugs[edit]

1. Installation script changes eng-kmr pipeline along with kmr-eng

2. Problems with tags order (syntactic label is not the last tag)

3. Words-without-a-label bug

<spectre> is it possible that some words don't get a label ?
<spectre> $ echo "Barzanî di peyama xwe de behsa mijarên girîng û kirîtîk kir." | apertium -d . kmr-eng-tagger
<spectre> ^Barzanî<np><ant><m><sg><obl><@dobj>$ ^di<pr><@case>$ ^peyam<n><f><sg><con><def><@nmod>$ ^xwe<prn><ref><mf><sp><@nmod:poss>$ 
^de<post><@case>$ ^behs<n><f><sg><con><def>$ ^mijar<n><f><pl><con><def><@nmod:poss>$ ^girîng<adj><@amod>$ ^û<cnjcoo><@cc>$ ^*kirîtîk$
^kirin<vblex><tv><past><p3><sg>$^..<sent><@punct>$

2 and 3 seem to be fixed, but it should be checked carefully.

To do[edit]

  • Do more tests. MORE.
  • Fix bugs
  • Refactore the main code.
  • Continue improving the perfomance of the models.