Difference between revisions of "Shallow syntactic function labeller"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]
This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project]

A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller


== Architecture ==
== Architecture ==
Line 85: Line 87:
|-
|-
|}
|}

== Progress ==
'''Week 1:''' Datasets for North Sami were created.
* Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
* Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
* In cases when there were two lines with analysis for one word, only one analysis has been left.
* Information about derivation was removed too.
* Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
* Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.

Revision as of 18:40, 5 June 2017

This is Google Summer of Code 2017 project

A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller

Architecture

1. The labeller takes a string in Apertium stream format with morphological tags:

^vino<n><m><sg>$ = INPUT

2. Parses it into a sequence of morphological tags:

<n><m><sg>

3. Restores the model for this language (which is in the same directory and looks like .json file or like a .pkl file)

4. The algorithm analyzes the string and gives a sequence of syntactic tags as an output.

<@nsubj>

5. The labeller applies given labels to the original string:

^vino<n><m><sg><@nsubj>$ = OUTPUT

So, in the end there will be a module itself and a file with a model.

Workplan

Week Dates To do
1 30th May — 5th June
  • Handling discrepancies between Apertium sme-nob and Sami corpus tagsets
  • Writing a script for parsing Sami corpus
2 6th June — 12th June
3 13th June — 19th June
4 20th June — 26th June
First evaluation

Ready-to-use datasets

5 27th June — 3rd July

Building the model

6 4th July — 10th July
  • Training the classifier
  • Evaluating the quality of the prototype
7 11th July — 17th July
  • Further training
  • Working on improvements of the model
8 18th July — 24th July
  • Final testing
  • Writing a script, which applies labels to the original string in Apertium stream format
Second evaluation

Well-trained model at least for North Sami

9 25th July — 31th July
  • Collecting all parts of the labeller together
  • Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
10 1st August — 7th August
  • Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
11 8th August — 14th August
  • Testing
  • Fixing bugs
12 15th August — 21th August
  • Cleaning up the code
  • Writing documentation
Final evaluation

The prototype shallow syntactic function labeller.

Progress

Week 1: Datasets for North Sami were created.

  • Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
  • Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
  • In cases when there were two lines with analysis for one word, only one analysis has been left.
  • Information about derivation was removed too.
  • Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
  • Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.