Difference between revisions of "Shallow syntactic function labeller"

From Apertium
Jump to navigation Jump to search
Line 35: Line 35:
 
|-
 
|-
 
| 1 || 30th May — 5th June ||
 
| 1 || 30th May — 5th June ||
* Handling discrepancies between Apertium sme-nob and Sami corpus tagsets
+
* <s>Handling discrepancies between Apertium sme-nob and Sami corpus tagsets</s>
* Writing a script for parsing Sami corpus
+
* <s>Writing a script for parsing Sami corpus</s>
 
|-
 
|-
 
| 2 || 6th June — 12th June ||
 
| 2 || 6th June — 12th June ||
 
 
 
|-
 
|-
| 3 || 13th June — 19th June ||
+
| 3 || 13th June — 19th June ||
 
 
 
|-
 
|-

Revision as of 18:46, 5 June 2017

This is Google Summer of Code 2017 project

A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller

Architecture

1. The labeller takes a string in Apertium stream format with morphological tags:

^vino<n><m><sg>$ = INPUT

2. Parses it into a sequence of morphological tags:

<n><m><sg>

3. Restores the model for this language (which is in the same directory and looks like .json file or like a .pkl file)

4. The algorithm analyzes the string and gives a sequence of syntactic tags as an output.

<@nsubj>

5. The labeller applies given labels to the original string:

^vino<n><m><sg><@nsubj>$ = OUTPUT

So, in the end there will be a module itself and a file with a model.

Workplan

Week Dates To do
1 30th May — 5th June
  • Handling discrepancies between Apertium sme-nob and Sami corpus tagsets
  • Writing a script for parsing Sami corpus
2 6th June — 12th June
3 13th June — 19th June
4 20th June — 26th June
First evaluation

Ready-to-use datasets

5 27th June — 3rd July

Building the model

6 4th July — 10th July
  • Training the classifier
  • Evaluating the quality of the prototype
7 11th July — 17th July
  • Further training
  • Working on improvements of the model
8 18th July — 24th July
  • Final testing
  • Writing a script, which applies labels to the original string in Apertium stream format
Second evaluation

Well-trained model at least for North Sami

9 25th July — 31th July
  • Collecting all parts of the labeller together
  • Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
10 1st August — 7th August
  • Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
11 8th August — 14th August
  • Testing
  • Fixing bugs
12 15th August — 21th August
  • Cleaning up the code
  • Writing documentation
Final evaluation

The prototype shallow syntactic function labeller.

Progress

Week 1: Datasets for North Sami were created.

  • Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
  • Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
  • In cases when there were two lines with analysis for one word, only one analysis has been left.
  • Information about derivation was removed too.
  • Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
  • Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.