Difference between revisions of "User:Mathematic-alpha/gsoc-report"
Line 97: | Line 97: | ||
= Project description = |
= Project description = |
||
A project on adoption of |
A project on adoption of Medumba-Français language pair in Apertium had as its purpose a creation of machine translation system between Medumba and Français languages. |
||
As Medumba is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. |
|||
As the evidence of that we had only about 12 texts from Wikipedia incubator (links in Manifest) and quasi inexistent Medumba-Français aligned text sources as a source. |
|||
Generally project consisted of three main parts: |
Generally project consisted of three main parts: |
||
* Morphological analyzer for |
* Morphological analyzer for Medumba |
||
* |
* Medumba-Français bilingual dictionary (bidix) |
||
* Transfer rules |
* Transfer rules |
||
A detailed work plan for the project can be found [http://wiki.apertium.org/wiki/User: |
A detailed work plan for the project can be found [http://wiki.apertium.org/wiki/User:Mathematic-alpha/gsoc-proposal here]. |
||
==Morphological Analyzer == |
==Morphological Analyzer == |
||
⚫ | |||
We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Guarani dictionaries. By the end of the Community Bonding period after 2 week of work we were able to analyze only 30% of words contained in wiki corpora. Coverage almost 90% by the end of Google Summer of Code program. |
|||
And the most challenging thing from the beginning was to find any properly organized lists of words or Medumba dictionaries. |
|||
By the end of the Community Bonding period after 2 week of work we were able to analyze only 20% of words contained in wiki corpora. |
|||
Coverage almost 90% by the end of Google Summer of Code program (if not of the bible text which is written in the old orthography). |
|||
Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. |
|||
<pre> |
|||
Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). |
|||
GRN-Wiki |
|||
These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. |
|||
To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory). |
|||
⚫ | |||
dc ago 8 17:30:19 CEST 2018 109:13135 455256/508418 ~0.89543643222702579374 |
|||
</pre> |
|||
<pre> |
|||
Bible |
|||
dc set 12 12:15:05 CEST 2018 109:13135 561107/623303 ~0.90021546503065122420 |
|||
</pre> |
|||
⚫ | |||
* 4455 Nouns |
|||
* 2537 Verbs (divided in two groups by transitivity) |
|||
* 1668 Adjectives |
|||
* 457 Adverbs |
|||
This list does not include other lemma categories such as Proper Names (3052), different kind of pronouns, barbarisms (more frequently borrowed from Spanish), interjections, etc. |
|||
But we are able to analyze even more because some of the words have orthographical equivalents (as there are several traditions in written Guarani) or some forms found in corpora are just spelling errors. |
|||
Example of annotation made by morph analyzer: |
|||
<pre> |
|||
echo "Ou omba'apo hag̃ua" | lt-proc grn.automorf.bin |
|||
^Ou/Ou<v><iv><pres>/Ou<v><iv><p3><sg><pres>/Ou<v><iv><p3><pl><pres>$ |
|||
^omba'apo/o<prn><pos><p3><sg>+mbaʼapo<n>$ ^hag̃ua/hag̃ua<post>$ |
|||
Viene a trabajo. |
|||
</pre> |
|||
Although we did not do any disambiguation so far. |
|||
Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory). |
|||
⚫ | |||
Line 155: | Line 128: | ||
Here are some sources that we used to construct morphological dictionary. |
Here are some sources that we used to construct morphological dictionary. |
||
# [http://descubrircorrientes.com.ar/2012/index.php/diccionario- |
# [http://descubrircorrientes.com.ar/2012/index.php/diccionario-Medumba/1-Medumba-espanol/1129-tai-c Discubrir Corrientes. La Enciclopedia Virtual Correntina] ; |
||
# [https://github.com/LowResourceLanguages/hltdi-l3/tree/master/dicts Low Resource Language Dictionaries]. |
# [https://github.com/LowResourceLanguages/hltdi-l3/tree/master/dicts Low Resource Language Dictionaries]. |
||
'''Grammar reference''' |
|||
# Estigarribia, B. (2017). Guarani linguistics in the 21st century. 1st ed. BRILL, p.420. |
|||
# Krivoshein de Canese, N. and Decoud Larrosa, R. (1983). Gramatica de la lengua guarani. Asuncion: Nemity Krivoshein de Canese. |
|||
== Bilingual Dictionary == |
== Bilingual Dictionary == |
||
Bilingual dictionary (or bidix) was constructed from the lexicons used in morphological dictionaries mentioned above. Bidix entry looks the following way: |
|||
<pre> |
|||
<e><p><l>mbaʼapo<s n="n"/></l><r>tarea<s n="n"/><s n="f"/></r></p><par n="n_n"/></e> |
|||
</pre> |
|||
With this dictionary we can translate left to right from Guarani to Spanish. |
|||
The development of the bidix was halted for to the benefit of the morphological analyser for the Mə̀dʉ̂mbɑ̀. |
|||
<pre> |
|||
echo "buma mɛn" | apertium -d . byv-fra |
|||
#Venir #el tarea para |
|||
</pre> |
|||
== Transfer rules == |
== Transfer rules == |
||
We cannot reproduce correctly the structure of the |
We cannot reproduce correctly the structure of the Français sentence as well as word forms with right Français inflections as we still have few transfer rules. |
||
Transfer rules are developed only for nouns and some postpositions so far but we are planning to do it in the nearest future. |
|||
== Plans for future == |
== Plans for future == |
||
# Increase coverage as much as possible ; |
# Increase coverage as much as possible ; |
||
# Write transfer rules ; |
# Write transfer rules ; |
||
# Develop the interface ; |
|||
= Acknowledgements and impressions of GSoC program = |
= Acknowledgements and impressions of GSoC program = |
||
First of all I would like to express my gratitude to |
First of all I would like to express my total gratitude to Anastasia kuznetsova, mentor, who has always been very helpful and supportive (really). |
||
In course of GSoC program she helped me when i was stuck and she was always available. She is really helpful. I also thank Jonathan Washington, mentor, who put in all his efforts to make this project a success. |
|||
Hopefully this project will motivate others and turn into larger academic work. |
Revision as of 01:08, 26 August 2019
PROJECT: Adopt an unreleased language pair with a minimal user interface.
Contents
STATS and REPOs
Medumba progress | |||
---|---|---|---|
Date | byv lexicon size | Corpus size | Coverage |
12.06.2019 | 189 | 8674 | ~0.37 |
20.06.2019 | 317 | 8674 | ~0.39 |
05.07.2019 | 317 | 8674 | ~0.56 |
10.07.2019 | 1186 | 6604 | ~0.83 |
2019-07-21 | 1703 | 22627 | ~65.21% |
2019-07-25 | 1705 | 22637 | ~69.19% |
2019-07-25 | 17594 | 22624 | ~77.77% |
2019-08-05 | 27896 | 35035 | ~79.62% |
2019-08-21 | 12245 | 14823 | ~90.01% |
Medumba-French progress | |||
---|---|---|---|
Date | byv-fra lexicon size | Coverage | WER, PER |
12.06.2019 | 1592 | ~0.10 | Something |
05.07.2019 | 1592 | ~0.20 | Something |
10.07.2019 | 1592 | ~0.35 | Something |
Monodix (Apertium-byv)
-My commits
Bidix (Apertium-byv-fra)
-My commits
Interface
Project description
A project on adoption of Medumba-Français language pair in Apertium had as its purpose a creation of machine translation system between Medumba and Français languages. As Medumba is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. As the evidence of that we had only about 12 texts from Wikipedia incubator (links in Manifest) and quasi inexistent Medumba-Français aligned text sources as a source.
Generally project consisted of three main parts:
- Morphological analyzer for Medumba
- Medumba-Français bilingual dictionary (bidix)
- Transfer rules
A detailed work plan for the project can be found here.
Morphological Analyzer
We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Medumba dictionaries. By the end of the Community Bonding period after 2 week of work we were able to analyze only 20% of words contained in wiki corpora. Coverage almost 90% by the end of Google Summer of Code program (if not of the bible text which is written in the old orthography).
Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory).
In addition we have managed to do syntactic tree annotation for some of the texts from our wiki corpora, although it is on the initial stage and requires corrections.
Sources
Here are some sources that we used to construct morphological dictionary.
Bilingual Dictionary
The development of the bidix was halted for to the benefit of the morphological analyser for the Mə̀dʉ̂mbɑ̀.
Transfer rules
We cannot reproduce correctly the structure of the Français sentence as well as word forms with right Français inflections as we still have few transfer rules. Transfer rules are developed only for nouns and some postpositions so far but we are planning to do it in the nearest future.
Plans for future
- Increase coverage as much as possible ;
- Write transfer rules ;
- Develop the interface ;
Acknowledgements and impressions of GSoC program
First of all I would like to express my total gratitude to Anastasia kuznetsova, mentor, who has always been very helpful and supportive (really). In course of GSoC program she helped me when i was stuck and she was always available. She is really helpful. I also thank Jonathan Washington, mentor, who put in all his efforts to make this project a success. Hopefully this project will motivate others and turn into larger academic work.