Difference between revisions of "Shallow syntactic function labeller"

Revision as of 18:56, 5 June 2017

1. The labeller takes a string in Apertium stream format with morphological tags:

^vino<n><m><sg>$ = INPUT

2. Parses it into a sequence of morphological tags:

<n><m><sg>

3. Restores the model for this language (which is in the same directory and looks like .json file or like a .pkl file)

4. The algorithm analyzes the string and gives a sequence of syntactic tags as an output.

<@nsubj>

5. The labeller applies given labels to the original string:

^vino<n><m><sg><@nsubj>$ = OUTPUT

So, in the end there will be a module itself and a file with a model.

Week	Dates	To do
1	30th May — 5th June	~~Handling discrepancies between Apertium sme-nob and Sami corpus tagsets~~ ~~Writing a script for parsing Sami corpus~~
2	6th June — 12th June	Writing a script for replacing UD tags with suitable Apertium tags and parsing UD-tree into a dataset
3	13th June — 19th June
4	20th June — 26th June
First evaluation	Ready-to-use datasets
5	27th June — 3rd July	Building the model
6	4th July — 10th July	Training the classifier Evaluating the quality of the prototype
7	11th July — 17th July	Further training Working on improvements of the model
8	18th July — 24th July	Final testing Writing a script, which applies labels to the original string in Apertium stream format
Second evaluation	Well-trained model at least for North Sami
9	25th July — 31th July	Collecting all parts of the labeller together Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
10	1st August — 7th August	Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
11	8th August — 14th August	Testing Fixing bugs
12	15th August — 21th August	Cleaning up the code Writing documentation
Final evaluation	The prototype shallow syntactic function labeller.

Week 1: Datasets for North Sami were created.

Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
In cases when there were two lines with analysis for one word, only one analysis has been left.
Information about derivation was removed too.
Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.

@@ Line 38: / Line 38: @@
 * <s>Writing a script for parsing Sami corpus</s>
 |-
-| 2     || 6th June — 12th June  ||
+| 2
+| 6th June — 12th June
+| rowspan="2" align=left " |  Writing a script for replacing UD tags with suitable Apertium tags and parsing UD-tree into a dataset
 |-
+| 3
-| 3     || 13th June — 19th June  ||
+| 13th June — 19th June
 |-