User:Mathematic-alpha/gsoc-report

From Apertium
Jump to navigation Jump to search

PROJECT: Adopt an unreleased language pair with a minimal user interface.

STATS and REPOs

Medumba progress
Date byv lexicon size Corpus size Coverage
12.06.2019 189 8674 ~0.37
20.06.2019 317 8674 ~0.39
05.07.2019 317 8674 ~0.56
10.07.2019 1186 6604 ~0.83
2019-07-21 1703 22627 ~65.21%
2019-07-25 1705 22637 ~69.19%
2019-07-25 17594 22624 ~77.77%
2019-08-05 27896 35035 ~79.62%
2019-08-21 12245 14823 ~90.01% (w/o bible text)
2019-08-26 22517 19223 ~85.37% (w/ bible text)
Medumba-French progress
Date byv-fra lexicon size Coverage WER, PER
12.06.2019 1592 ~0.10 Something
05.07.2019 1592 ~0.20 Something
10.07.2019 1592 ~0.35 Something


Monodix (Apertium-byv)
-My commits

Bidix (Apertium-byv-fra)
-My commits

Interface





Project description

A project on the adoption of Medumba-Français language pair in Apertium had as its purpose creation of machine translation system between Medumba and Français languages. As Medumba is one of the low-resource languages of the world the translation system is unlikely to be developed by other methods than Rule-Based Machine Translation. As the evidence of that, we had only about 12 texts from Wikipedia incubator (links in Manifest) and quasi inexistent Medumba-Français aligned text sources as a source.

Generally, the project consisted of three main parts:

  • Morphological analyzer for Medumba
  • Medumba-Français bilingual dictionary (bidix)
  • Transfer rules

A detailed work plan for the project can be found here.

Morphological Analyzer

We had to develop morphological analyzer almost from scratch. And the most challenging thing from the beginning was to find any properly organized lists of words or Medumba dictionaries and even online corpus. By the end of the Community Bonding period after 2 weeks of work, we were able to analyze less than 20% of words contained in wiki corpora. Coverage almost 90% by the end of Google Summer of Code program (if not of the bible text which is written in the old orthography).

The morphological analyzer is accompanied by spellrelax file with rules for transcribing the old orthography to the new one.


Bilingual Dictionary

The development of the bidix was halted for the benefit of the morphological analyser for the Mə̀dʉ̂mbɑ̀.

Transfer rules

We cannot reproduce correctly the structure of the Français sentence as well as word forms with right Français inflexions as we still have few transfer rules. Transfer rules are developed only for very few nouns so far but we are planning to do it in the nearest future.

Plans for future

  1. Increase coverage as much as possible ;
  2. Write transfer rules ;
  3. Develop the interface ;

Acknowledgements and impressions of GSoC program

First of all, I would like to express my total gratitude to Anastasia Kuznetsova, mentor, who has always been very helpful and supportive (really). In the course of the GSoC program, she helped me when I was stuck and she was always available. She is really helpful. I also thank Jonathan Washington, mentor, who put in all his efforts to make this project a success. Hopefully, this project will motivate others and turn into larger community work.