Difference between revisions of "Shallow syntactic function labeller/Workplan"

From Apertium
Jump to navigation Jump to search
Line 109: Line 109:


The main things to do:
The main things to do:
* Add an ability to handle more than one sentence.
* <s>Add an ability to handle more than one sentence.</s>
* Do more tests.
* Do more tests.
* Write docstrings and refactore the main code.
* Write docstrings and refactore the main code.

Revision as of 07:33, 15 August 2017

A workplan and all progress notes about Shallow syntactic function labeller GSoC 2017 project.

Workplan

Week Dates To do
1 30th May — 5th June
  • Handling discrepancies between Apertium sme-nob and Sami corpus tagsets
  • Writing a script for parsing Sami corpus
2 6th June — 12th June Writing scripts for replacing UD tags with suitable Apertium tags and parsing UD-tree into a dataset for Kazakh, Breton and English UD dependency treebanks
3 13th June — 19th June
4 20th June — 26th June Writing scripts for converting UD-treebanks (dev and test) of needed languages in Apertium stream format (converted treebanks will be useful for evaluating the quality of the labeller)
First evaluation

Ready-to-use datasets

5 27th June — 3rd July Building and training the classifier
6 4th Jule — 10th July
7 11th July — 17th July
  • Further training
  • Working on improvements of the model
8 18th July — 24th July

Working on improvements of the model

Second evaluation

Well-trained models

9 25th July — 31th July
  • Collecting all parts of the labeller together
  • Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
10 1st August — 7th August
  • Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
11 8th August — 14th August
  • Testing
  • Fixing bugs
12 15th August — 21th August
  • Cleaning up the code
  • Writing documentation
Final evaluation

The prototype shallow syntactic function labeller.

Progress

Week 1: Datasets for North Sami were created.

  • Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
  • Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
  • In cases when there were two lines with analysis for one word, only one analysis has been left.
  • Information about derivation was removed too.
  • Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
  • Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.

Weeks 2-3: Datasets for Kazakh, Breton and English were created.

NB: the datasets for North Sami and English seem to be pretty big, when Kazakh is comparably small and Breton is even smaller. But it gives us opportunity to check how many data will be enough for training the labeller and is it possible to achieve pretty good results having very small amount of data (like in case of Breton)

  • All dependency treebanks were "flattened": words with the @conj and the @parataxis relation took the label of their head (https://github.com/deltamachine/wannabe_hackerman/blob/master/flatten_conllu.py).
  • For all languages two types of datasets were created: the first type contains tags for punctuation and the second does not.
  • Kazakh
    • some mistakes in conllu file were corrected
    • double lines were removed
  • English
    • double lines were removed
    • all UD POS and features tags were replaced with Apertium tags
  • Breton
    • some mistakes in conllu file were corrected
    • double lines were removed
    • all UD features tags were replaced with Apertium tags

Week 4: Scripts for converting Kazakh, Breton and English UD-treebanks in Apertium stream format were written

Weeks 5-6: Two types of networks were built: a simple RNN network and an encoder-decoder network with attentive mechanism. It seems that simple RNN shows better results in all our cases. Encoder-decoder network shows acceptable results only on very big datasets, like English dataset, but in case of small corpus, like Breton, it is useless, and simple RNN can work with small datasets.

Weeks 7-8: What new was created:

  • New datasets (tokens + tags instead of just tags)
  • Word2vec and fastText embeddings

Fasttext embeddings helped to improve accuracy of Kazakh model on 10% and accuracy of English model on 3%. Current results: 76% for Breton, 78% for Kazakh, 82% for North Sami and 80% for English. However, models are still need to be improved somehow.

Weeks 9-10: The labeller was built and successfully added instead of the original syntax module in sme-nob pipeline. Also the script for changing the modes was written. Everything seems to okay, though the translation may be not as good as with the original module. Perhaps this is because of some additional functionality of the original sme-nob.syn.rlx.bin.

The main things to do:

  • Add an ability to handle more than one sentence.
  • Do more tests.
  • Write docstrings and refactore the main code.