Difference between revisions of "Shallow syntactic function labeller"

Revision as of 22:42, 22 June 2017

This is Google Summer of Code 2017 project

A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller

Architecture

1. The labeller takes a string in Apertium stream format with morphological tags:

^vino<n><m><sg>$ = INPUT

2. Parses it into a sequence of morphological tags:

<n><m><sg>

3. Restores the model for this language (which is in the same directory and looks like .json file or like a .pkl file)

4. The algorithm analyzes the string and gives a sequence of syntactic tags as an output.

<@nsubj>

5. The labeller applies given labels to the original string:

^vino<n><m><sg><@nsubj>$ = OUTPUT

So, in the end there will be a module itself and a file with a model.

Workplan

Week	Dates	To do
1	30th May — 5th June	~~Handling discrepancies between Apertium sme-nob and Sami corpus tagsets~~ ~~Writing a script for parsing Sami corpus~~
2	6th June — 12th June	~~Writing scripts for replacing UD tags with suitable Apertium tags and parsing UD-tree into a dataset for Kazakh, Breton and English UD dependency treebanks~~
3	13th June — 19th June
4	20th June — 26th June	Writing scripts for converting UD-treebanks (dev and test) of needed languages in Apertium stream format (converted treebanks will be useful for evaluating the quality of the labeller)
First evaluation	Ready-to-use datasets
5	27th June — 3rd July	Building the model
6	4th July — 10th July	Training the classifier Evaluating the quality of the prototype
7	11th July — 17th July	Further training Working on improvements of the model
8	18th July — 24th July	Final testing Writing a script, which applies labels to the original string in Apertium stream format
Second evaluation	Well-trained model at least for North Sami
9	25th July — 31th July	Collecting all parts of the labeller together Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
10	1st August — 7th August	Adding machine-learned module instead of the syntax labelling part of sme-nob CG module to test it
11	8th August — 14th August	Testing Fixing bugs
12	15th August — 21th August	Cleaning up the code Writing documentation
Final evaluation	The prototype shallow syntactic function labeller.

Italic text

Progress

Week 1: Datasets for North Sami were created.

Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
In cases when there were two lines with analysis for one word, only one analysis has been left.
Information about derivation was removed too.
Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.

Weeks 2-3: Datasets for Kazakh, Breton and English were created.

NB: the datasets for North Sami and English seem to be pretty big, when Kazakh is comparably small and Breton is even smaller. But it gives us opportunity to check how many data will be enough for training the labeller and is it possible to achieve pretty good results having very small amount of data (like in case of Breton)

All dependency treebanks were "flattened": words with the @conj and the @parataxis relation took the label of their head (https://github.com/deltamachine/wannabe_hackerman/blob/master/flatten_conllu.py).
For all languages two types of datasets were created: the first type contains tags for punctuation and the second does not.
Kazakh
- some mistakes in conllu file were corrected
- double lines were removed
English
- double lines were removed
- all UD POS and features tags were replaced with Apertium tags
Breton
- some mistakes in conllu file were corrected
- double lines were removed
- all UD features tags were replaced with Apertium tags

@@ Line 40: / Line 40: @@
 | 2
 | 6th June — 12th June
-| rowspan="2" align=left " |  Writing a script for replacing UD tags with suitable Apertium tags and parsing UD-tree into a dataset
+| rowspan="2" align=left " |  <s>Writing scripts for replacing UD tags with suitable Apertium tags and parsing UD-tree into a dataset for Kazakh, Breton and English UD dependency treebanks</s>
 |-
@@ Line 47: / Line 47: @@
 |-
+| 4      || 20th June — 26th June  ||   Writing scripts for converting UD-treebanks (dev and test) of needed languages in Apertium stream format (converted treebanks will be useful for evaluating the quality of the labeller)
-| 4      || 20th June — 26th June  ||
 |-
@@ Line 90: / Line 90: @@
 |-
 |}
+''Italic text''
 == Progress ==
 '''Week 1:''' Datasets for North Sami were created.
@@ Line 99: / Line 99: @@
 * Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
 * Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.
+'''Weeks 2-3:''' Datasets for Kazakh, Breton and English were created.
+NB: the datasets for North Sami and English seem to be pretty big, when Kazakh is comparably small and Breton is even smaller. But it gives us opportunity to check how many data will be enough for training the labeller and is it possible to achieve pretty good results having very small amount of data (like in case of Breton)
+* All dependency treebanks were "flattened": words with the @conj and the @parataxis relation took the label of their head (https://github.com/deltamachine/wannabe_hackerman/blob/master/flatten_conllu.py).
+* For all languages two types of datasets were created: the first type contains tags for punctuation and the second does not.
+* ''Kazakh''
+** some mistakes in conllu file were corrected
+** double lines were removed
+* ''English''
+** double lines were removed
+** all UD POS and features tags were replaced with Apertium tags
+* ''Breton''
+** some mistakes in conllu file were corrected
+** double lines were removed
+** all UD features tags were replaced with Apertium tags

Difference between revisions of "Shallow syntactic function labeller"

Revision as of 22:42, 22 June 2017

Architecture

Workplan

Progress

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools