User:Eden/GSoC2019Report

From Apertium
Jump to navigation Jump to search

Introduction

The goal of this project was to start the English-Lingala pair and and write an usable version which provides intelligible output.

Evaluation of Work Done

Morphological Analyzer

Code is here.(I directly committed everything into the repo)
Before GSoC 2019, the Lingala transducer was already fairly well-developed. It could accurately recognize and classify most part-of-speech. My work mainly consisted in adding more vocabulary and missing morphology. The transducer had ~700 stems before GSoC and as of now, it contains ~1,500 stems. The original goal was to have ~7,000 stems but due to a lack of digitized resources, I could only get so far. The Wikipedia dump was great because it provided me with a lof of vocabulary and it was also useful for diacritic restoration, but unfortunately it also contained a lof of French, Portuguese, and English words.

I also added missing morphology for adjectives and pronouns to handle the 'old' Lingala orthography. This increased coverage by about 4% at the time. My mentor, Jonorthwash, also added more spell relax rules(thanks again btw)

Current state of the transducer:

  • Stems: 1,524
  • Wikipedia naïve coverage: 77.29%
  • Bible naïve coverage: 93.72%

Bilingual dictionary

Code is here
The bilingual dictionary was written from scratch. Vocabulary mainly came from dictionaries and personal knowledge. The bilingual dictionary(apertium-eng-lin) contains a lot of one-to-many words because of the ambiguous nature of Lingala. Then I also wrote some transfer rules for both directions. A lot of rules and macros were recycled from more mature pairs(eng-fra, eng-cat) which makes the code cleaner and easier for adding rules later on. Transfer rules were limited only to first-level(.t1x) rules because other levels weren't yet necessary. Given the ambiguity of Lingala, I found lexical selection rules to be very effective in solving some of them.

Something to note is that Lingala has different dialects, each has grammar rules and an orthography that slightly differ from the rest. The two main dialects are Literary and Spoken Lingala. You can read more about it on this PDF. Transfer rules and the Lingala transducer work best with Spoken Lingala. The Wikipedia corpus mostly contains Literary Lingala, while other texts(Bile, Quran) are written in Spoken Lingala. Which is why the Bible translation has a much more intelligible output than the Wikipedia translation.

Current state of the bilingual dictionary:

  • Stems: 1,802
  • Wikipedia naïve coverage: 72.61%
  • Bible naïve coverage: 90.50%
  • WER of story(Lin-Eng): 47.15%
  • WER of same story as the above(Eng-Lin): 50.93%
  • Lexical selection rules: ~30
  • Testvoc(Wikipedia corpus lin-eng):
    • Number of tokenised words in the corpus: 589,666
    • Number of tokenised words unknown to analyser: 147,573 — 25.0% of tokens had *
    • Tokenised words unknown to bidix:0 — 0.0% of tokens had @
    • Tokenised words w/transfer errors or unknown to generator: 12,171 — 2.1% of tokens had #
    • Error-free coverage of analyser only: 442,093 — 75.0% of tokens had no *
    • Error-free coverage of analyser and bidix: 442,093 — 75.0% of tokens had no */@
    • Error-free coverage of the full translator: 429,922 — 72.9% of tokens had no */@/#

Translating verbs proved to be the most difficult thing to do(resulted in most #). Lingala verbs contain in them the person,tense,mood,number, and animacy.

Future Work

  • Transfer rules for verbs that deal with tense, mood, compound and radical extension.
  • Second-level rules(.t2x) for alliterative agreement which will result in a more literary Lingala translation.
  • Using offline resources to get vocabulary

Acknowledgments

I would like to thank the whole Apertium community, specifically, my mentors, Jonathan Washington, Mikel L. Forcada, and Анастасия Кузнецова for their support, mentorship, and patience.