Apertium cat-srd and ita-srd/GSoC 2017

From Apertium
Jump to navigation Jump to search

Google Summer of Code 2017 Gianfranco Fronteddu Final report

Work and Commits

You can see my work including the code and a full list of commits here: https://apertium.projectjj.com/gsoc2017/gfro3d.html.

Student Information

Name: Gianfranco Fronteddu

Location: Casteddu, Sardigna

E-mail: gfro3d@gmail.com

IRC: gianfranco

SourceForge: gfro3d

Telegram: gianfro4moros

Skype: gianfranco.fronteddu88

Description

The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation from Catalan to Sardinian and the continuation of last year's project, apertium ita-srd. This idea comes from the desire to develop another tool to help Sardinian language, as done in the last year.

As can be seen in the "Work Plan", there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and another one, shorter, held in August, to start preparing a new Sardinian-Italian translator.

To complete the "Work Plan", it was decided to do something even in the month of August for improving the translator from Italian to Sardinian, assuming that the objectives of the first longest phase were achieved on schedule. When we could verify that the results for cat-srd at the end of July were good, we spent some time in the translator from Italian to Sardinian.

First phase: Apertium cat-srd (May, 29th - July, 29th)

The first coding phase lasting until the second evaluation of GSoC concerned Catalan-Sardinian translator. The translator, thanks to the work done by Francis Tyers, was initially in the "staging" section and started from a number of 2645 in the bilingual dictionary, a Coverage of about 77% and a WER error rate of 34.8 %. The goal was to get 90% coverage and to lower the WER to less than 15%.

Unlike last year, in which it was necessary to develop almost the whole Sardinian morphological dictionary and also to improve aspects of the Italian morphological analyzer, this year we started from two already developed languages on the Apertium platform. We could focus mainly on transferring from one language to another, i.e. words, morphological and syntactic structures.

In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between Catalan and Sardinian has been done to create "pending tests". Also, referring to the "pending test" of cat-srd, structural differences in numerals, possessive forms, duty formulas and continuous tenses, past tenses, future, conditional and clitic were highlighted.

Sardinian morphological dictionary

Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the CROS lemmas, developed in the previous year. 15,500 more words have been added: c. 1300 nouns, 800 adjectives, 300 adverbs 250 verbs and 12,500 proper names. This is, for the most part, scientific and technical terminology, and socio-political vocabulary. Except for proper names, for which other criteria were followed, the selection of the words to be introduced was made from a Wikipedia corpus.

Then the Sardinian dictionary have been adjusted, removing non-standard (LSC) words and correcting mistakes in the assignment of paradigms (especially speaking of the genre assigned to some nouns).

Catalan morphological dictionary

Also in the Catalan morphological dictionary proper names have been added, almost 10,000.

Morphological disambiguation in Catalan

15 new morphological disambiguation rules for Catalan have been written and a few more have been modified.

Lexical selection rules

The translator has 274 lexical selection rules. These are rules that choose between two or more possible translations in a given context.

Transfer rules

The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example:

  • El meu llibre > Su libru meu
  • Vaig menjar > Apo mandigadu
  • He anat > So andadu
  • Vull saludar-lo > Lu chèrgio saludare

    Quality

    Quality assessment is used to see how the translator works in practice.

    Word Error Rate (WER) is an indicator that shows the words that need to be changed in order to publish the text. According to the Work Plan, the goal was to get a percentage lower than 15%. The rate of errors in translation is 13.9% (number obtained by the WER indicator calculated on randomly taken texts from Wikipedia - 600 words).

    The Coverage (percentage of recognized words) is 94% (number obtained from a large Wikipedia corpus).

    Text in Catalan (chosen randomly)

    L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes.

    Machine translation to Sardinian

    S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene.

    Second phase: Apertium srd-ita (July, 29th - August, 29th)

    In the second phase of the project we worked in preparation for a new translator srd-ita. What we could do was start a manual morphological disambiguation of the corpora that has to help the translator to recognize the correct morphology of each word, especially in texts which don't respect the standard orthographic LSC. We have treated two corpora: one journalist and more dialect, and other taken directly from literary texts written in perfect LSC. Of the first, 6000 words were added, of the second 11800.

    They have been added 9 new transfer rules and correct some of those that were already there. Now tenses from the Sardinian to Italian are translated correctly. It also improved the translation of possessive and is some cases of enclitics (Sardinian there may be up to three enclitics, whereas in Italian can not there be more than two).

    We did something even in the Italian dictionary, adding 4 rules of morphological disambiguation (a very important was the disambiguation of "sono" as "so" and "sunt"). Additionally, we have added a list of countries in the world (who gave us Diegu Corràine) and we have obtained the corresponding gentiles. Since the beginning of GSoC 2017 1400 words have been added to the bilingual dictionary ita-srd. The cleaning of the Sardinian dictionary from the mistakes and adding new entries will quickly develop a new version of ita-srd dictionary.

    Resources

    Future plans

    The work of the second phase of the project will be used for the creation of a more accurate and updated version of ita-srd.