Difference between revisions of "User:Anakuznetsova/GSOC 2018 Guarani Spanish"

From Apertium
Jump to navigation Jump to search
Line 44: Line 44:
 
This list does not include other lemma categories such as Proper Names (3052), different kind of pronouns, barbarisms (more frequently borrowed from Spanish), interjections, etc.
 
This list does not include other lemma categories such as Proper Names (3052), different kind of pronouns, barbarisms (more frequently borrowed from Spanish), interjections, etc.
   
But we are able to analyze even more because some of them have orthographical equivalents (as there are several traditions in written Guarani) or some forms found in corpora are just spelling errors.
+
But we are able to analyze even more because some of the words have orthographical equivalents (as there are several traditions in written Guarani) or some forms found in corpora are just spelling errors.
   
 
Example of annotation made by morph analyzer:
 
Example of annotation made by morph analyzer:
Line 59: Line 59:
 
Although we did not do any disambiguation so far.
 
Although we did not do any disambiguation so far.
   
Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc.
+
Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory).
   
In addition we have managed to do some [https://github.com/ana-kuznetsova/apertium-grn/blob/master/texts/eval1.txt syntactic tree annotation] for some of the texts from our wiki corpora, although it is on the initial stage and requires corrections.
+
In addition we have managed to do [https://github.com/ana-kuznetsova/apertium-grn/blob/master/texts/eval1.txt syntactic tree annotation] for some of the texts from our wiki corpora, although it is on the initial stage and requires corrections.
   
   

Revision as of 13:21, 8 August 2018

Adoption of Guarani-Spanish language pair in Apertium

GSoC Commits

All the GSoC commits on the project could be found here.

Contacts

Anastasia Kuznetsova

E-mail: menina.indigena.17@gmail.com

GitHub: ana-kuznetsova

IRC: anakuz

Timezone: UTC+3

Project description

A project on adoption of Guarani-Spanish language pair in Apertium had as its purpose a creation of machine translation system between Guarani and Spanish languages. As Guarani is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. As the evidence of that we had only about 2800 texts from Wikipedia dumps [1] and Guarani-Spanish aligned Bible [2] as a source.

Generally project consisted of three main parts:

  • Morphological analyzer for Guarani
  • Guarani-Spanish bilingual dictionary (bidix)
  • Transfer rules

A detailed work plan for the project can be found here.

Morphological Analyzer

We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Guarani dictionaries. By the end of the Community Bonding period after 2 week of work we were able to analyze only 30% of words contained in wiki corpora. Coverage reached (?) % by the end of Google Summer of Code program.

dt jul 31 11:32:46 CEST 2018 105:13125 452601/506387  ~0.89378479305353415471

All in all our morphological analyzer contains 12 638 lemmas:

  • 4455 Nouns
  • 2537 Verbs (divided in two groups by transitivity)
  • 1668 Adjectives
  • 457 Adverbs

This list does not include other lemma categories such as Proper Names (3052), different kind of pronouns, barbarisms (more frequently borrowed from Spanish), interjections, etc.

But we are able to analyze even more because some of the words have orthographical equivalents (as there are several traditions in written Guarani) or some forms found in corpora are just spelling errors.

Example of annotation made by morph analyzer:

echo "Ou omba'apo hag̃ua" | lt-proc grn.automorf.bin

^Ou/Ou<v><iv><pres>/Ou<v><iv><p2><sg><pres>/Ou<v><iv><p3><pl><pres>$
^omba'apo/o<prn><pos><p3><sg>+mbaʼapo<n>$ ^hag̃ua/hag̃ua<post>$

Vino a trabajo. 

Although we did not do any disambiguation so far.

Morphological analyzer is accompanied by twol file where 30 phonological rules are implemented. Twol is a kind of formalism used in Helsinki Finite State Technology (HFST). These rules resolve special cases of form transformations which could not be solved in morph analyzer, for example change of affixes after nasals etc. To prevent our transducer from phonological errors we wrote tests for various cases (contained in "tests" directory).

In addition we have managed to do syntactic tree annotation for some of the texts from our wiki corpora, although it is on the initial stage and requires corrections.


Sources

Here are some sources that we used to construct morphological dictionary.

  1. Discubrir Corrientes. La Enciclopedia Virtual Correntina ;
  2. Low Resource Language Dictionaries.

Grammar reference

  1. Estigarribia, B. (2017). Guarani linguistics in the 21st century. 1st ed. BRILL, p.420.
  2. Krivoshein de Canese, N. and Decoud Larrosa, R. (1983). Gramatica de la lengua guarani. Asuncion: Nemity Krivoshein de Canese.

Bilingual Dictionary

Bilingual dictionary (or bidix) was constructed from the lexicons used in morphological dictionaries mentioned above. Bidix entry looks the following way:

<e><p><l>mbaʼapo<s n="n"/></l><r>tarea<s n="n"/><s n="f"/></r></p><par n="n_n"/></e>

With this dictionary we can translate left to right from Guarani to Spanish.

 
echo "Ou omba'apo hag̃ua" | apertium -d . grn-spa
#Venir #el tarea para

Transfer rules

We cannot reproduce correctly the structure of the Spanish sentence as well as word forms with right inflections as we still have few transfer rules.

Things to do

  1. Increase coverage ;
  2. Write transfer rules ;
  3. Morphological disambiguation.