Apertium cat-srd and ita-srd/GSoC 2017

From Apertium
Jump to navigation Jump to search

Google Summer of Code 2017 Gianfranco Fronteddu Final report

Work and Commits

You can see my work including the code and a full list of commits here: https://apertium.projectjj.com/gsoc2017/gfro3d.html.

Student Information

Name: Gianfranco Fronteddu

Location: Casteddu, Sardigna

E-mail: gfro3d@gmail.com

IRC: gianfranco

SourceForge: gfro3d

Telegram: gianfro4moros

Skype: gianfranco.fronteddu88

Description

The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation from Catalan to Sardinian and the continuation of last year's project, apertium ita-srd. This idea comes from the desire to develop another tool to help Sardinian language, as done in the last year.

As can be seen in the "Work Plan", there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and another one, shorter, held in August, to start preparing a new Sardinian-Italian translator.

To complete the "Work Plan", it was decided to do something even in the month of August for improving the translator from Italian to Sardinian, assuming that the objectives of the first longest phase were achieved on schedule. When we could verify that the results for cat-srd at the end of July were good, we spent some time in the translator from Italian to Sardinian.

First phase: Apertium cat-srd (May, 29th - July, 29th)

The first coding phase lasting until the second evaluation of GSoC concerned Catalan-Sardinian translator. The translator, thanks to the work done by Francis Tyers, was initially in the "staging" section and started from a number of 2645 in the bilingual dictionary, a Coverage of about 77% and a WER error rate of 34.8 %. The goal was to get 90% coverage and to lower the WER to less than 15%.

Unlike last year, during which to develop the translator it was necessary to develop almost all the morphological dictionary and also improve aspects of the Italian morphological analyzer, this year started from two languages already developed on the Apertium platform. We could focus mainly on transferring from one language to another, speaking of words, morphological and syntactic structures.

In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between Italian and Sardinian has been done to create "pending tests". Also, referring to the "pending test" of ita-srd, structural differences in numerals, possessive forms, duty formulas and continuous tences, past tences, future, conditional and clitic were highlighted.

Sardinian morphological dictionary

Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the CROS lemmas, developed during the previous. There were added 15,500 more words: 1300 nouns, 800 adjectives, 300 adverbs 250 verbs and 12,500 own names. It is, for the most part, scientific and technical terminology, and socio-political vocabulary. Except for proper names, for which other criteria were followed, the selection of the words to be introduced was made by the Wikipedia. Then the dictionary was adjusted, removing many words that were not normative and correcting mistakes in the assignment of paradigms (especially speaking of the genre assigned to some nouns).

Catalan morphological dictionary

Also in the Catalan morphological dictionary there was an addition of proper names, almost 10,000.

Morphological disambiguation in catalan

In Catalan dictionary they were written 15 of morphological disambiguation rules and someone else has been modified.

Lexical selection rules

The translator has 274 lexical selection rules. These are rules that choose which of two or more possible translations is most appropriate in a given context.

Transfer rules

The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example:

El meu llibre > Su libru meu

Vaig menjar > Apo mandigadu

He anat > So andadu

Vull saludar-lo > Lu chèrgio saludare

Quality

The quality assessment is used to see how the translator works in practice. There are many ways to do it, and the choice of the texts depends on how the translator will be used: simply, you need to calculate how many words you have to change in order to publish the text.

Word Error Rate (WER) is the indicator that indicates the words that need to be changed in order to publish the text. According to the Work Plan, the goal was to get a lower percentage of 15%. The rate of errors in translation is 13.9% (number obtained by WER indicator calculated on texts taken randomly of 600 words from Wikipedia).

Coverage (percentage of recognized words) is 94% (number obtained from a large corpus of Wikipedia).

Text in Catalan (chosen randomly)

L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes.

Machine-translation to Sardinian

S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene.

Second phase: Apertium srd-ita (July, 29th - August, 29th)

In the second phase of the project we worked in preparation for a new translator srd-ita. What we could do was start a manual morphological disambiguation of the corpora that has to help the translator to recognize the correct morphology of each word, especially in texts which don't respect the standard orthographic LSC. We have treated two corpora: one journalist and more dialect, and other taken directly from literary texts written in perfect LSC. Of the first, 6000 words were added, of the second 11800.

They have been added 9 new transfer rules and correct some of those that were already there. Now tenses from the Sardinian to Italian are translated correctly. It also improved the translation of possessive and is some cases of enclitics (Sardinian there may be up to three enclitics, whereas in Italian can not there be more than two).

We did something even in the Italian dictionary, adding 4 rules of morphological disambiguation (a very important was the disambiguation of "sono" as "so" and "sunt"). Additionally, we have added a list of countries in the world (who gave us Diegu Corràine) and we have obtained the corresponding gentiles. Since the beginning of GSoC 2017 1400 words have been added to the bilingual dictionary ita-srd. The cleaning of the Sardinian dictionary from the mistakes and adding new entries will quickly develop a new version of ita-srd dictionary.

Resources

Future plans

The work of the second phase of the project will be used for the creation of a more accurate and updated version of ita-srd.