Difference between revisions of "User:Mathematic-alpha/gsoc-report"
m |
|||
(3 intermediate revisions by the same user not shown) | |||
Line 50: | Line 50: | ||
|- |
|- |
||
| 2019-08-05 |
| 2019-08-05 |
||
| 27896 |
| 27896 (???) |
||
| 35035 |
| 35035 (??? ) |
||
| ~79.62% |
| ~79.62% |
||
|- |
|- |
||
Line 57: | Line 57: | ||
| 12245 |
| 12245 |
||
| 14823 |
| 14823 |
||
| ~90.01% |
| ~90.01% (w/o bible text) |
||
|- |
|||
| 2019-08-26 |
|||
| 22517 |
|||
| 19223 |
|||
| ~85.37% (w/ bible text) |
|||
|} |
|} |
||
Line 97: | Line 102: | ||
= Project description = |
= Project description = |
||
A project on adoption of Medumba-Français language pair in Apertium had as its purpose |
A project on the adoption of Medumba-Français language pair in Apertium had as its purpose creation of machine translation system between Medumba and Français languages. |
||
As Medumba is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. |
As Medumba is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. |
||
As the evidence of that we had only about 12 texts from Wikipedia incubator (links in Manifest) and quasi inexistent Medumba-Français aligned text sources as a source. |
As the evidence of that, we had only about 12 texts from Wikipedia incubator (links in Manifest) and quasi inexistent Medumba-Français aligned text sources as a source. |
||
Generally project consisted of three main parts: |
Generally, the project consisted of three main parts: |
||
* Morphological analyzer for Medumba |
* Morphological analyzer for Medumba |
||
Line 107: | Line 112: | ||
* Transfer rules |
* Transfer rules |
||
A detailed work plan for the project can be found [http://wiki.apertium.org/wiki/User:Mathematic-alpha/ |
A detailed work plan for the project can be found [http://wiki.apertium.org/wiki/User:Mathematic-alpha/proposal here]. |
||
==Morphological Analyzer == |
==Morphological Analyzer == |
||
We had to develop morphological analyzer almost from scratch. |
We had to develop morphological analyzer almost from scratch. |
||
And the most challenging thing from the beginning was to find any properly organized lists of words or Medumba dictionaries. |
And the most challenging thing from the beginning was to find any properly organized lists of words or Medumba dictionaries and even online corpus. |
||
By the end of the Community Bonding period after 2 |
By the end of the Community Bonding period after 2 weeks of work, we were able to analyze less than 20% of words contained in wiki corpora. |
||
Coverage almost 90% by the end of Google Summer of Code program (if not of the bible text which is written in the old orthography). |
Coverage almost 90% by the end of Google Summer of Code program (if not of the bible text which is written in the old orthography). |
||
The morphological analyzer is accompanied by spellrelax file with rules for transcribing the old orthography to the new one. |
|||
Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). |
|||
These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. |
|||
To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory). |
|||
In addition we have managed to do [https://github.com/ana-kuznetsova/apertium-BYV/blob/master/texts/eval1.txt syntactic tree annotation] for some of the texts from our wiki corpora, although it is on the initial stage and requires corrections. |
|||
'''Sources''' |
|||
Here are some sources that we used to construct morphological dictionary. |
|||
# [http://descubrircorrientes.com.ar/2012/index.php/diccionario-Medumba/1-Medumba-espanol/1129-tai-c Discubrir Corrientes. La Enciclopedia Virtual Correntina] ; |
|||
# [https://github.com/LowResourceLanguages/hltdi-l3/tree/master/dicts Low Resource Language Dictionaries]. |
|||
== Bilingual Dictionary == |
== Bilingual Dictionary == |
||
The development of the bidix was halted for |
The development of the bidix(bilingual dictionary) was halted for the benefit of the morphological analyser for the Mə̀dʉ̂mbɑ̀. This is so because to have a good bilingual dictionary, both monolingual dictionaries must work fine. |
||
== Transfer rules == |
== Transfer rules == |
||
We cannot reproduce correctly the structure of the Français sentence as well as word forms with right Français |
We cannot reproduce correctly the structure of the Français sentence as well as word forms with right Français inflexions as we still have few transfer rules. |
||
Transfer rules are developed only for |
Transfer rules are developed only for very few nouns so far but we are planning to do it in the nearest future. |
||
== Plans for future == |
== Plans for future == |
||
Line 144: | Line 137: | ||
# Write transfer rules ; |
# Write transfer rules ; |
||
# Develop the interface ; |
# Develop the interface ; |
||
The community in charge of the development of Medumba(or any Cameroonian language committee) are not really versed with the techniques to develop a complete dictionary. The main aim is to provide some sort of abstraction built over GitHub with a combination of CI tools and virtual machines to automate the build |
|||
= Acknowledgements and impressions of GSoC program = |
= Acknowledgements and impressions of GSoC program = |
||
First of all I would like to express my total gratitude to Anastasia |
First of all, I would like to express my total gratitude to Anastasia Kuznetsova, mentor, who has always been very helpful and supportive (really). |
||
In course of GSoC program she helped me when |
In the course of the GSoC program, she helped me when I was stuck and she was always available. She is really helpful. I also thank Jonathan Washington, mentor, who put in all his efforts to make this project a success. |
||
Hopefully this project will motivate others and turn into larger |
Hopefully, this project will motivate others and turn into larger community work. |
Latest revision as of 10:26, 28 August 2019
PROJECT: Adopt an unreleased language pair with a minimal user interface.
Contents
STATS and REPOs[edit]
Medumba progress | |||
---|---|---|---|
Date | byv lexicon size | Corpus size | Coverage |
12.06.2019 | 189 | 8674 | ~0.37 |
20.06.2019 | 317 | 8674 | ~0.39 |
05.07.2019 | 317 | 8674 | ~0.56 |
10.07.2019 | 1186 | 6604 | ~0.83 |
2019-07-21 | 1703 | 22627 | ~65.21% |
2019-07-25 | 1705 | 22637 | ~69.19% |
2019-07-25 | 17594 | 22624 | ~77.77% |
2019-08-05 | 27896 (???) | 35035 (??? ) | ~79.62% |
2019-08-21 | 12245 | 14823 | ~90.01% (w/o bible text) |
2019-08-26 | 22517 | 19223 | ~85.37% (w/ bible text) |
Medumba-French progress | |||
---|---|---|---|
Date | byv-fra lexicon size | Coverage | WER, PER |
12.06.2019 | 1592 | ~0.10 | Something |
05.07.2019 | 1592 | ~0.20 | Something |
10.07.2019 | 1592 | ~0.35 | Something |
Monodix (Apertium-byv)
-My commits
Bidix (Apertium-byv-fra)
-My commits
Interface
Project description[edit]
A project on the adoption of Medumba-Français language pair in Apertium had as its purpose creation of machine translation system between Medumba and Français languages. As Medumba is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. As the evidence of that, we had only about 12 texts from Wikipedia incubator (links in Manifest) and quasi inexistent Medumba-Français aligned text sources as a source.
Generally, the project consisted of three main parts:
- Morphological analyzer for Medumba
- Medumba-Français bilingual dictionary (bidix)
- Transfer rules
A detailed work plan for the project can be found here.
Morphological Analyzer[edit]
We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Medumba dictionaries and even online corpus. By the end of the Community Bonding period after 2 weeks of work, we were able to analyze less than 20% of words contained in wiki corpora. Coverage almost 90% by the end of Google Summer of Code program (if not of the bible text which is written in the old orthography).
The morphological analyzer is accompanied by spellrelax file with rules for transcribing the old orthography to the new one.
Bilingual Dictionary[edit]
The development of the bidix(bilingual dictionary) was halted for the benefit of the morphological analyser for the Mə̀dʉ̂mbɑ̀. This is so because to have a good bilingual dictionary, both monolingual dictionaries must work fine.
Transfer rules[edit]
We cannot reproduce correctly the structure of the Français sentence as well as word forms with right Français inflexions as we still have few transfer rules. Transfer rules are developed only for very few nouns so far but we are planning to do it in the nearest future.
Plans for future[edit]
- Increase coverage as much as possible ;
- Write transfer rules ;
- Develop the interface ;
The community in charge of the development of Medumba(or any Cameroonian language committee) are not really versed with the techniques to develop a complete dictionary. The main aim is to provide some sort of abstraction built over GitHub with a combination of CI tools and virtual machines to automate the build
Acknowledgements and impressions of GSoC program[edit]
First of all, I would like to express my total gratitude to Anastasia Kuznetsova, mentor, who has always been very helpful and supportive (really). In the course of the GSoC program, she helped me when I was stuck and she was always available. She is really helpful. I also thank Jonathan Washington, mentor, who put in all his efforts to make this project a success. Hopefully, this project will motivate others and turn into larger community work.