Difference between revisions of "User:Anakuznetsova/GSOC 2018 Guarani Spanish"
(24 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
== GSoC Commits == |
== GSoC Commits == |
||
All the GSoC commits on the project could be found |
All the GSoC commits on the project could be found [https://apertium.projectjj.com/gsoc2018/ana-kuznetsova/ here]. |
||
== Contacts == |
== Contacts == |
||
Line 11: | Line 11: | ||
'''GitHub:''' ana-kuznetsova |
'''GitHub:''' ana-kuznetsova |
||
'''IRC:''' anakuz |
|||
'''Timezone:''' UTC+3 |
'''Timezone:''' UTC+3 |
||
Line 16: | Line 18: | ||
= Project description = |
= Project description = |
||
A project |
A project on adoption of Guarani-Spanish language pair in Apertium had as its purpose a creation of machine translation system between Guarani and Spanish languages. As Guarani is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. As the evidence of that we had only about 2800 texts from Wikipedia dumps [https://dumps.wikimedia.org/backup-index.html] and Guarani-Spanish aligned Bible [https://www.bible.com/es/versions/972-guabd-biblia-guarani-tumpa-inee] as a source. |
||
Generally project consisted of three main parts: |
Generally project consisted of three main parts: |
||
Line 28: | Line 30: | ||
==Morphological Analyzer == |
==Morphological Analyzer == |
||
We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Guarani dictionaries. By the end of the Community Bonding period after 2 week of work we were able to analyze only 30% of words contained in wiki corpora. Coverage |
We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Guarani dictionaries. By the end of the Community Bonding period after 2 week of work we were able to analyze only 30% of words contained in wiki corpora. Coverage almost 90% by the end of Google Summer of Code program. |
||
<pre> |
|||
GRN-Wiki |
|||
dc ago 8 17:30:19 CEST 2018 109:13135 455256/508418 ~0.89543643222702579374 |
|||
</pre> |
|||
<pre> |
|||
Bible |
|||
dc set 12 12:15:05 CEST 2018 109:13135 561107/623303 ~0.90021546503065122420 |
|||
</pre> |
|||
All in all our morphological analyzer contains 12 638 lemmas: |
All in all our morphological analyzer contains 12 638 lemmas: |
||
Line 38: | Line 52: | ||
This list does not include other lemma categories such as Proper Names (3052), different kind of pronouns, barbarisms (more frequently borrowed from Spanish), interjections, etc. |
This list does not include other lemma categories such as Proper Names (3052), different kind of pronouns, barbarisms (more frequently borrowed from Spanish), interjections, etc. |
||
But we are able to analyze even more because some of |
But we are able to analyze even more because some of the words have orthographical equivalents (as there are several traditions in written Guarani) or some forms found in corpora are just spelling errors. |
||
Example of annotation made by morph analyzer: |
|||
<pre> |
|||
echo "Ou omba'apo hag̃ua" | lt-proc grn.automorf.bin |
|||
^Ou/Ou<v><iv><pres>/Ou<v><iv><p3><sg><pres>/Ou<v><iv><p3><pl><pres>$ |
|||
^omba'apo/o<prn><pos><p3><sg>+mbaʼapo<n>$ ^hag̃ua/hag̃ua<post>$ |
|||
Viene a trabajo. |
|||
</pre> |
|||
Although we did not do any disambiguation so far. |
|||
Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory). |
|||
In addition we have managed to do [https://github.com/ana-kuznetsova/apertium-grn/blob/master/texts/eval1.txt syntactic tree annotation] for some of the texts from our wiki corpora, although it is on the initial stage and requires corrections. |
|||
'''Sources''' |
'''Sources''' |
||
Line 46: | Line 78: | ||
# [http://descubrircorrientes.com.ar/2012/index.php/diccionario-guarani/1-guarani-espanol/1129-tai-c Discubrir Corrientes. La Enciclopedia Virtual Correntina] ; |
# [http://descubrircorrientes.com.ar/2012/index.php/diccionario-guarani/1-guarani-espanol/1129-tai-c Discubrir Corrientes. La Enciclopedia Virtual Correntina] ; |
||
# [https://github.com/LowResourceLanguages/hltdi-l3/tree/master/dicts Low Resource Language Dictionaries]. |
# [https://github.com/LowResourceLanguages/hltdi-l3/tree/master/dicts Low Resource Language Dictionaries]. |
||
'''Grammar reference''' |
|||
# Estigarribia, B. (2017). Guarani linguistics in the 21st century. 1st ed. BRILL, p.420. |
|||
# Krivoshein de Canese, N. and Decoud Larrosa, R. (1983). Gramatica de la lengua guarani. Asuncion: Nemity Krivoshein de Canese. |
|||
== Bilingual Dictionary == |
|||
Bilingual dictionary (or bidix) was constructed from the lexicons used in morphological dictionaries mentioned above. Bidix entry looks the following way: |
|||
<pre> |
|||
<e><p><l>mbaʼapo<s n="n"/></l><r>tarea<s n="n"/><s n="f"/></r></p><par n="n_n"/></e> |
|||
</pre> |
|||
With this dictionary we can translate left to right from Guarani to Spanish. |
|||
<pre> |
|||
echo "Ou omba'apo hag̃ua" | apertium -d . grn-spa |
|||
#Venir #el tarea para |
|||
</pre> |
|||
== Transfer rules == |
|||
We cannot reproduce correctly the structure of the Spanish sentence as well as word forms with right Spanish inflections as we still have few transfer rules. Transfer rules are developed only for nouns and some postpositions so far but we are planning to do it in the nearest future. |
|||
== Plans for future == |
|||
# Increase coverage as much as possible ; |
|||
# Write transfer rules ; |
|||
# Morphological disambiguation (was not in list of our goals for GSoC). |
|||
= Acknowledgements and impressions of GSoC program = |
|||
First of all I would like to express my gratitude to Francis Tyers, mentor, who has always been very helpful and supportive. In course of GSoC program he made me even more curious about linguistic technologies and Guarani language as well. He is highly qualified both in linguistics and technology. Hopefully this project will turn into larger academic work. |
|||
Google Summer of Code became a very good opportunity for me as a person who came from a walk of life other than IT. It helped me to understand the processes of product development. Furthermore this kind of internship shows that not all the tasks (but rather most of the tasks) at the work place will be painstaking and even monotone. But at the end this period of doing monotone things stops and you face a challenge to find nontrivial and interesting solution for concrete cases. This is precisely what holds the interest of whole work. The initiative of creating and financial support made this program a real work experience for me where I had the moments of solo work as well as fruitful collaboration with a mentor. |
|||
Working in Apertium was a great choice as well as this organization is already experienced in receiving students for summer. It's inner structure and communication of all the Apertium members in IRC is well organized and is improving by chat-bot. This helps to understand the challenges of other contributors and it is always possible to ask a question there if your mentor is unavailable. Although working in open source organization sometimes becomes difficult due to lack of documentation that sometimes is not very fresh. That makes you address to mentor (or community) more frequently even if it probably would be possible to do some things without extra help. General impression of Apertium is truly positive. It was my pleasure to work there. |
Latest revision as of 18:42, 16 September 2018
Contents
Adoption of Guarani-Spanish language pair in Apertium[edit]
GSoC Commits[edit]
All the GSoC commits on the project could be found here.
Contacts[edit]
Anastasia Kuznetsova
E-mail: menina.indigena.17@gmail.com
GitHub: ana-kuznetsova
IRC: anakuz
Timezone: UTC+3
Project description[edit]
A project on adoption of Guarani-Spanish language pair in Apertium had as its purpose a creation of machine translation system between Guarani and Spanish languages. As Guarani is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. As the evidence of that we had only about 2800 texts from Wikipedia dumps [1] and Guarani-Spanish aligned Bible [2] as a source.
Generally project consisted of three main parts:
- Morphological analyzer for Guarani
- Guarani-Spanish bilingual dictionary (bidix)
- Transfer rules
A detailed work plan for the project can be found here.
Morphological Analyzer[edit]
We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Guarani dictionaries. By the end of the Community Bonding period after 2 week of work we were able to analyze only 30% of words contained in wiki corpora. Coverage almost 90% by the end of Google Summer of Code program.
GRN-Wiki dc ago 8 17:30:19 CEST 2018 109:13135 455256/508418 ~0.89543643222702579374
Bible dc set 12 12:15:05 CEST 2018 109:13135 561107/623303 ~0.90021546503065122420
All in all our morphological analyzer contains 12 638 lemmas:
- 4455 Nouns
- 2537 Verbs (divided in two groups by transitivity)
- 1668 Adjectives
- 457 Adverbs
This list does not include other lemma categories such as Proper Names (3052), different kind of pronouns, barbarisms (more frequently borrowed from Spanish), interjections, etc.
But we are able to analyze even more because some of the words have orthographical equivalents (as there are several traditions in written Guarani) or some forms found in corpora are just spelling errors.
Example of annotation made by morph analyzer:
echo "Ou omba'apo hag̃ua" | lt-proc grn.automorf.bin ^Ou/Ou<v><iv><pres>/Ou<v><iv><p3><sg><pres>/Ou<v><iv><p3><pl><pres>$ ^omba'apo/o<prn><pos><p3><sg>+mbaʼapo<n>$ ^hag̃ua/hag̃ua<post>$ Viene a trabajo.
Although we did not do any disambiguation so far.
Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory).
In addition we have managed to do syntactic tree annotation for some of the texts from our wiki corpora, although it is on the initial stage and requires corrections.
Sources
Here are some sources that we used to construct morphological dictionary.
Grammar reference
- Estigarribia, B. (2017). Guarani linguistics in the 21st century. 1st ed. BRILL, p.420.
- Krivoshein de Canese, N. and Decoud Larrosa, R. (1983). Gramatica de la lengua guarani. Asuncion: Nemity Krivoshein de Canese.
Bilingual Dictionary[edit]
Bilingual dictionary (or bidix) was constructed from the lexicons used in morphological dictionaries mentioned above. Bidix entry looks the following way:
<e><p><l>mbaʼapo<s n="n"/></l><r>tarea<s n="n"/><s n="f"/></r></p><par n="n_n"/></e>
With this dictionary we can translate left to right from Guarani to Spanish.
echo "Ou omba'apo hag̃ua" | apertium -d . grn-spa #Venir #el tarea para
Transfer rules[edit]
We cannot reproduce correctly the structure of the Spanish sentence as well as word forms with right Spanish inflections as we still have few transfer rules. Transfer rules are developed only for nouns and some postpositions so far but we are planning to do it in the nearest future.
Plans for future[edit]
- Increase coverage as much as possible ;
- Write transfer rules ;
- Morphological disambiguation (was not in list of our goals for GSoC).
Acknowledgements and impressions of GSoC program[edit]
First of all I would like to express my gratitude to Francis Tyers, mentor, who has always been very helpful and supportive. In course of GSoC program he made me even more curious about linguistic technologies and Guarani language as well. He is highly qualified both in linguistics and technology. Hopefully this project will turn into larger academic work.
Google Summer of Code became a very good opportunity for me as a person who came from a walk of life other than IT. It helped me to understand the processes of product development. Furthermore this kind of internship shows that not all the tasks (but rather most of the tasks) at the work place will be painstaking and even monotone. But at the end this period of doing monotone things stops and you face a challenge to find nontrivial and interesting solution for concrete cases. This is precisely what holds the interest of whole work. The initiative of creating and financial support made this program a real work experience for me where I had the moments of solo work as well as fruitful collaboration with a mentor.
Working in Apertium was a great choice as well as this organization is already experienced in receiving students for summer. It's inner structure and communication of all the Apertium members in IRC is well organized and is improving by chat-bot. This helps to understand the challenges of other contributors and it is always possible to ask a question there if your mentor is unavailable. Although working in open source organization sometimes becomes difficult due to lack of documentation that sometimes is not very fresh. That makes you address to mentor (or community) more frequently even if it probably would be possible to do some things without extra help. General impression of Apertium is truly positive. It was my pleasure to work there.