Difference between revisions of "Shallow syntactic function labeller"
Deltamachine (talk | contribs) |
Deltamachine (talk | contribs) |
||
Line 35: | Line 35: | ||
|- |
|- |
||
| 1 || 30th May — 5th June || |
| 1 || 30th May — 5th June || |
||
* Handling discrepancies between Apertium sme-nob and Sami corpus tagsets |
* <s>Handling discrepancies between Apertium sme-nob and Sami corpus tagsets</s> |
||
* Writing a script for parsing Sami corpus |
* <s>Writing a script for parsing Sami corpus</s> |
||
|- |
|- |
||
| 2 || 6th June — 12th June || |
| 2 || 6th June — 12th June || |
||
|- |
|- |
||
| 3 |
| 3 || 13th June — 19th June || |
||
|- |
|- |
Revision as of 18:46, 5 June 2017
This is Google Summer of Code 2017 project
A repository for the project: https://github.com/deltamachine/shallow_syntactic_function_labeller
Architecture
1. The labeller takes a string in Apertium stream format with morphological tags:
^vino<n><m><sg>$ = INPUT
2. Parses it into a sequence of morphological tags:
<n><m><sg>
3. Restores the model for this language (which is in the same directory and looks like .json file or like a .pkl file)
4. The algorithm analyzes the string and gives a sequence of syntactic tags as an output.
<@nsubj>
5. The labeller applies given labels to the original string:
^vino<n><m><sg><@nsubj>$ = OUTPUT
So, in the end there will be a module itself and a file with a model.
Workplan
Week | Dates | To do |
---|---|---|
1 | 30th May — 5th June |
|
2 | 6th June — 12th June | |
3 | 13th June — 19th June | |
4 | 20th June — 26th June | |
First evaluation |
Ready-to-use datasets | |
5 | 27th June — 3rd July |
Building the model |
6 | 4th July — 10th July |
|
7 | 11th July — 17th July |
|
8 | 18th July — 24th July |
|
Second evaluation |
Well-trained model at least for North Sami | |
9 | 25th July — 31th July |
|
10 | 1st August — 7th August |
|
11 | 8th August — 14th August |
|
12 | 15th August — 21th August |
|
Final evaluation |
The prototype shallow syntactic function labeller. |
Progress
Week 1: Datasets for North Sami were created.
- Some tags in the original corpus were replaced with Apertium North Sami tags (like here: https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.regex).
- Some tags were removed from the original corpus as irrelevant: ABBR, ACR, Allegro, G3, G7, <ext>, Foc_, Qst.
- In cases when there were two lines with analysis for one word, only one analysis has been left.
- Information about derivation was removed too.
- Special "fake" syntactical functions were added for CLB and PUNCT: @CLB and @PUNCT.
- Two types of datasets were created: the first type contains tags for punctuation and clause boundaries and the second does not.