Difference between revisions of "User:Ruthenian8/GSOC 2021 progress report"
Ruthenian8 (talk | contribs) |
Ruthenian8 (talk | contribs) m |
||
(4 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
* '''Proposal''': [https://drive.google.com/file/d/1Y05eQtFP7ioz50z2GlUdvB2Edh4lel6G/view?usp=sharing proposal] |
* '''Proposal''': [https://drive.google.com/file/d/1Y05eQtFP7ioz50z2GlUdvB2Edh4lel6G/view?usp=sharing proposal] |
||
* '''Abstract''': Bagvalal is an endangered typologically rare Caucasian language from the Nakh-Daghestanian family. Its conservation and study are constrained by the lack of sufficient NLP-tools that can be used to process field data. <br/>My proposal is to develop an fst-powered morphological analyzer for Bagvalal using all the available grammatical and lexical information. In the future this project can allow Apertium to support morphological analysis for multiple Nakh-Daghestanian languages and develop corresponding language pairs. |
* '''Abstract''': Bagvalal is an endangered typologically rare Caucasian language from the Nakh-Daghestanian family. Its conservation and study are constrained by the lack of sufficient NLP-tools that can be used to process field data. <br/>My proposal is to develop an fst-powered morphological analyzer for Bagvalal using all the available grammatical and lexical information. In the future this project can allow Apertium to support morphological analysis for multiple Nakh-Daghestanian languages and develop corresponding language pairs. |
||
* '''GitHub repo''': [https://github.com/ruthenian8/bagvalal bagvalal] |
* '''GitHub repo (get the code)''': [https://github.com/ruthenian8/bagvalal bagvalal] |
||
* '''Status''':<br/> |
|||
The analyzer was implemented from scratch using .lexd, .twol and HFST-toolkit.<br/> |
|||
The product currently supports both the Cyrillic scripture and the Caucasiologist transcription and can be used for lemmatizing and glossing Bagvalal corpora<br/> |
|||
It covers all the parts of speech present in the language.<br/> |
|||
Naïve coverage 83%, type coverage 76%.<br/> |
|||
* '''What remains to be done''':<br/> |
|||
- Correct the two-level morphology rules to reduce generation ambiguity (causes test failure).<br/> |
|||
- Manually add morphologically exceptional words to raise the coverage<br/> |
|||
{| class="wikitable" |
{| class="wikitable" |
||
Line 24: | Line 32: | ||
! scope="row"| Week 5 |
! scope="row"| Week 5 |
||
| Adding the missing adjectives and adverbs from the available dictionaries (see the Resources section above).<br/>Testing the analysis results and the model performance. |
| Adding the missing adjectives and adverbs from the available dictionaries (see the Resources section above).<br/>Testing the analysis results and the model performance. |
||
| Complete |
|||
| In progress |
|||
|- |
|- |
||
! scope="row"| Week 6 |
! scope="row"| Week 6 |
||
| Adding the missing nouns.<br/>Testing the analysis results and the model performance. |
| Adding the missing nouns.<br/>Testing the analysis results and the model performance. |
||
| Complete |
|||
| In progress |
|||
|- |
|- |
||
! scope="row"| Week 7 & 8 |
! scope="row"| Week 7 & 8 |
||
| Adding the missing verbs, participles, converbs and masdars. <br/>Testing the analysis results and the model performance. |
| Adding the missing verbs, participles, converbs and masdars. <br/>Testing the analysis results and the model performance. |
||
| Complete |
|||
| In progress |
|||
|- |
|- |
||
! scope="row"| Week 9 & 10 |
! scope="row"| Week 9 & 10 |
||
| Tokenizing the corpora.<br/>Converting the existing annotations to an appropriate format<br/>Creating word-analysis pairs.<br/>Writing documentation. |
| Tokenizing the corpora.<br/>Converting the existing annotations to an appropriate format<br/>Creating word-analysis pairs.<br/>Writing documentation. |
||
| Complete |
|||
| In progress |
|||
|- |
|- |
||
! scope="row"| Week 11 |
! scope="row"| Week 11 |
||
| Expelling the false analyses from the model<br/>Testing and debugging.<br/>Finishing the work on the documentation |
| Expelling the false analyses from the model<br/>Testing and debugging.<br/>Finishing the work on the documentation |
||
| Complete |
|||
| In progress |
|||
|- |
|- |
||
! scope="row"| Week 12 |
! scope="row"| Week 12 |
||
| Running all the tests and debugging |
| Running all the tests and debugging |
||
| Complete |
|||
| In progress |
|||
|} |
|} |
||
'''Development log''': |
'''Development log''': |
||
07.08.21 |
|||
During the discussion with mentors it was determined that the Bagvalal-specific glosses which are present in the grammar and which have previously been incorporated in the analyzer should be left without modification. Taking into account that the analyzer is already in a ready state, it was decided to develop additional bilingual plug-ins for Russian and Avar. |
Latest revision as of 09:17, 18 August 2021
- Title: Morphological analyzer for Bagvalal
- Proposal: proposal
- Abstract: Bagvalal is an endangered typologically rare Caucasian language from the Nakh-Daghestanian family. Its conservation and study are constrained by the lack of sufficient NLP-tools that can be used to process field data.
My proposal is to develop an fst-powered morphological analyzer for Bagvalal using all the available grammatical and lexical information. In the future this project can allow Apertium to support morphological analysis for multiple Nakh-Daghestanian languages and develop corresponding language pairs. - GitHub repo (get the code): bagvalal
- Status:
The analyzer was implemented from scratch using .lexd, .twol and HFST-toolkit.
The product currently supports both the Cyrillic scripture and the Caucasiologist transcription and can be used for lemmatizing and glossing Bagvalal corpora
It covers all the parts of speech present in the language.
Naïve coverage 83%, type coverage 76%.
- What remains to be done:
- Correct the two-level morphology rules to reduce generation ambiguity (causes test failure).
- Manually add morphologically exceptional words to raise the coverage
Week | Intended changes | Status |
---|---|---|
Week 1 | Testing and refining the existing rules for the closed word classes (e. g. numerals, clitics and pronouns). | Complete |
Week 2 | Writing documentation and tests. | Complete |
Week 3 & 4 | Testing and refining the existing rules for the open word classes (e. g. verbs, nouns and adjectives). Writing documentation and tests. |
Complete |
Week 5 | Adding the missing adjectives and adverbs from the available dictionaries (see the Resources section above). Testing the analysis results and the model performance. |
Complete |
Week 6 | Adding the missing nouns. Testing the analysis results and the model performance. |
Complete |
Week 7 & 8 | Adding the missing verbs, participles, converbs and masdars. Testing the analysis results and the model performance. |
Complete |
Week 9 & 10 | Tokenizing the corpora. Converting the existing annotations to an appropriate format Creating word-analysis pairs. Writing documentation. |
Complete |
Week 11 | Expelling the false analyses from the model Testing and debugging. Finishing the work on the documentation |
Complete |
Week 12 | Running all the tests and debugging | Complete |
Development log: 07.08.21 During the discussion with mentors it was determined that the Bagvalal-specific glosses which are present in the grammar and which have previously been incorporated in the analyzer should be left without modification. Taking into account that the analyzer is already in a ready state, it was decided to develop additional bilingual plug-ins for Russian and Avar.