User:Lguyogiro/GSoC2023Proposal

From Apertium
Jump to navigation Jump to search

Contact Information[edit]

Name: Robert Pugh
Email:<last_name>rob at iu dot edu
Github: https://github.com/lguyogiro


Why I am interested in Apertium[edit]

I am interested in Apertium's rule-based MT approach for a number of reasons: - it is quite likely the only viable option for many of the languages I am interested in supporting, since most languages lack sufficient digital data, parallel, corpora, etc. to work well for alternative, data-hungry approaches. - rule-based machine translation makes the interpretability of results much simpler than popular methods such as gigantic deep networks with millions of parameters. I like the idea that, if there is an issue with the system, a developer can track it down and fix it, instead of simply blaming bad training data. - relatively low environmental impact. With popular NLP models generating carbon emissions equivalent to flying around the world in a commercial jet, there is a lot to be said for a more minimal approach to language technology, especially when the performance is still good! - As a linguist, I truly enjoy spending time analyzing and understanding languages. Working with Apertium tools (based on my experience developing monolingual repos for a few languages) allows the engineer to develop a good sense of how the languages behave and how to account for interesting linguistic patterns.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Adopt an unreleased language pair. My idea is to develop a language pair for Highland Puebla Nahuatl (`azz`) and Western Sierra Puebla Nahuatl (`nhi`). Both are endangered variants of Nahuatl (`nah`, but different branches), and are in contact where they are spoken. `azz` is a higher-resource language with publications and government materials, whereas `nhi` is substantially more endangered with only some short stories available. There are monolingual repositories for both, so I'm interested in developing the bilingual dictionaries and transfer rules. This would really nice to have, especially since speakers of both languages often live in the same or neighboring areas, and there is interest in being able to translate materials from one variant to another.

My Proposal[edit]

Mentors/Experienced members in Contact[edit]

Francis Tyers, Daniel Swanson

Brief of deliverables[edit]

  • Improvements (namely, lexicon expansion) in the monolingual repos for nhi and azz.
  • Creation of a azz-nhi bilingual dictionary
  • Lexical selection and transfer rules for nhi-azz and azz-nhi
  • Translator for nhi-azz and azz-nhi with WER <20%
  • validated (via native speaker interviews) translated text.


Why Google and Apertium should sponsor it[edit]

`nhi` and `azz` are two quite different related languages spoken in adjacent and overlapping areas, where mestizos (Spanish speakers) are the majority in the urban centers. Typical treatment of Nahuatl ignores important dialect distinctions, and frequently resources are only available in at most a few variants. In order to contribute to better representation of Nahuatl speakers in these communities, maximizing resources and encouraging communication in Nahuatl between these two language communities should be prioritized. Furthermore, official documents released to Nahuatl speakers are typically only done in one of these nahuatl varieties, whereas ideally any official document in Nahuatl released in this region would be translated into both prominent Nahuatl variants. An Apertium MT system for azz-nhi could enable the sharing of available resources, the development of better, tailored language technology, and support outreach and linguistic landscaping efforts (see below).

How and who it will benefit in society[edit]

The Apertium community is committed to under-resourced and minoritized/marginalized languages, and was designed in particular with related languages in mind. The case of Nahuatl variants has been entirely neglected in the field of MT, and even often in human translation. The Mexican government and language activist organizations typically only produce materials in a small subset of Nahuatl variants. Being able to adapt texts into a neighboring variant would mean increasing the total volume of textual resources available to Nahuatl speakers/readers/writers. Furthermore, since nhi and azz are spoken in adjacent and somewhat overlapping areas, efforts at community engagement and linguistic landscaping (e.g. signs or public announcements in Nahuatl) could be quickly translated to ensure larger community engagement. I am hopeful that this project will also lead to future development of other Nahuatl language pairs (there are 30 formally recognized Nahuatl variants). Finally, as a side-effect, from an academic perspective the proejct would be one of the first thorough, computationally-oriented explorations of morphosyntactic variation between nahuatl variants, and will contribute to an overall better understanding of nahuatl dialectology.


Workplan[edit]

Phase Dates Description of Work
Deliverable
Community bonding period:
familiarize myself better with the platform

May 4th -May 24th - Read carefully through all Apertium docs related to language pair development
- Harmonize tagsets in nhi and azz repos
- start expanding monolingual repo coverage by adding stems.
Community bonding period: expand monolingual coverage May 25th - May 28th - add stems to nhi from recent transcription project
- get formal permission to use mesolex dictionary for azz lexicon
Improvements (namely, lexicon expansion) in the monolingual repos for nhi and azz.
Week 1: closed categories May 29th - June tth - Add closed categories (e.g. prn, det, cnj, scnj, etc)to bilingual dict.
- Add lexical selection rules for closed categories
Week 2: nouns to bidix June 5th - June 12th - Add nouns to bilingual dict
Week 3: more nouns & adjectives to bidix June 12th - June 19th - Add adjectives to bilingual dict
- Add relational nouns to bilingual dict

- Lexical selection rules for nouns
Week 4: adverbs to bidix June 19th - June 26th - Add adverbs to bilingual dic
- lexical selection rules for adverbs
Week 5: Verbs June 26th - July 3rd - Start adding verbs to bilingual dict
- lexical selection rules for verbs
Week 6: Verbs II July 3rd - July 10th - continue adding verbs to bilingual dict
- lexical selection rules for verbs
- bilingual dictionary complete (with respect to available data and coverage requirements)
- lexical selection rules
Week 7: Transfer rules (closed categories, nouns, adjectives) July 10th - July 17th - explore syntactic variation in available corpora
- add transfer rules for closed categories, nouns, and adjectives
Week 8: Transfer rules (adverbs, relational nouns) July 17th - July 24th - transfer rules for adverbs and relational nouns
Week 9: Transfer rules (verbs) - transfer rules for verbs -transfer rules
Week 10: Evaluation pt 1 Identify participants to evaluate translated texts in nhi and azz
Week 11: Evaluation pt 2 Host interviews to elicit feedback from speakers re intelligibility
Week 12: Evaluation pt 3 Aggregate results of interviews, offer holistic qualitative evaluation of translation system. - validated translated text

Skills[edit]

- Programming: Python (13 years), bash scripting (13 years), SQL (10 years), Javascript.

- comfortable using git.

- operating systems: Linux

- languages: English, Spanish, Nahuatl, Maya

- experience: 8 years in industry as a natural language processing engineer; developed and maintain `apertium-nhi`, contributed to `apertium-azz`, and currently develop `apertium-yua`.

Coding challenge[edit]

I have set up an azz-nhi repo.

[1].

As a test case, I am translating a sentence from a book written in azz: "onka ome taman : tein ipa nojpal huan tein mochihua tepejxikuako" (there are two types: that which is the cactus his/her medicine and that which grows on the precipice of the mountain," with nhi translation: "catqui ome tlamantl: tlen ipah nopal huan tlen mochihua tipehxicuaco."

Non-Summer-of-Code plans for the Summer[edit]

- vacation during May - possible conference in July.