Difference between revisions of "Shallow syntactic function labeller"
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) |
||
Line 1: | Line 1: | ||
This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project] |
This is [http://wiki.apertium.org/wiki/User:Deltamachine/proposal Google Summer of Code 2017 project] |
||
A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller |
|||
== Architecture == |
== Architecture == |
||
Line 85: | Line 87: | ||
|- |
|- |
||
|} |
|} |
||
== Progress == |
|||
'''Week 1:''' Datasets for North Sami were created. |
|||
* Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex). |
|||
* Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst. |
|||
* In cases when there were two lines with analysis for one word, only one analysis has been left. |
|||
* Information about derivation was removed too. |
|||
* Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT. |
|||
* Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not. |
Revision as of 18:40, 5 June 2017
This is Google Summer of Code 2017 project
A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller
Architecture
1. The labeller takes a string in Apertium stream format with morphological tags:
^vino<n><m><sg>$ = INPUT
2. Parses it into a sequence of morphological tags:
<n><m><sg>
3. Restores the model for this language (which is in the same directory and looks like .json file or like a .pkl file)
4. The algorithm analyzes the string and gives a sequence of syntactic tags as an output.
<@nsubj>
5. The labeller applies given labels to the original string:
^vino<n><m><sg><@nsubj>$ = OUTPUT
So, in the end there will be a module itself and a file with a model.
Workplan
Week | Dates | To do |
---|---|---|
1 | 30th May — 5th June |
|
2 | 6th June — 12th June | |
3 | 13th June — 19th June | |
4 | 20th June — 26th June | |
First evaluation |
Ready-to-use datasets | |
5 | 27th June — 3rd July |
Building the model |
6 | 4th July — 10th July |
|
7 | 11th July — 17th July |
|
8 | 18th July — 24th July |
|
Second evaluation |
Well-trained model at least for North Sami | |
9 | 25th July — 31th July |
|
10 | 1st August — 7th August |
|
11 | 8th August — 14th August |
|
12 | 15th August — 21th August |
|
Final evaluation |
The prototype shallow syntactic function labeller. |
Progress
Week 1: Datasets for North Sami were created.
- Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
- Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
- In cases when there were two lines with analysis for one word, only one analysis has been left.
- Information about derivation was removed too.
- Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
- Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.