Difference between revisions of "Apertium cat-srd and ita-srd/GSoC 2017"

From Apertium
Jump to navigation Jump to search
 
(23 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Google Summer of Code 2017 Gianfranco Fronteddu==
== Google Summer of Code 2017 Gianfranco Fronteddu Final report==


===Modified files and Commits===
===Work and Commits===
You can see a full list of commits and modified files [https://apertium.projectjj.com/gsoc2017/gfro3d.html here].
You can see my work including the code and a full list of commits here: https://apertium.projectjj.com/gsoc2017/gfro3d.html.

In this repository you can find apertium cat-srd https://svn.code.sf.net/p/apertium/svn/trunk/apertium-cat-srd


== Student Information ==
== Student Information ==
Line 21: Line 23:


==Description==
==Description==
The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation between Catalan and Sardinian and the continuation of last year's project, apertium ita-srd. This idea comes from the desire to develop another tool to help Sardinian language, following the same way last year.
The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation from Catalan to Sardinian and the continuation of last year's project, apertium ita-srd. This idea comes from the desire to develop another tool to help Sardinian language, as done in the last year.


As can be seen in the [http://wiki.apertium.org/wiki/Catalan_and_Sardinian/Work_plan "Work Plan"], there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and the other, shorter, held in August, to start preparing a new Sardinian-Italian translator.
As can be seen in the [http://wiki.apertium.org/wiki/Catalan_and_Sardinian/Work_plan "Work Plan"], there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and another one, shorter, held in August, to start preparing a new Sardinian-Italian translator.

To complete the [http://wiki.apertium.org/wiki/Catalan_and_Sardinian/Work_plan "Work Plan"], it was decided to do something even in the month of August for srd-ita, assuming that the objectives of the first longest phase were achieved on schedule. When we could verify that the results for cat-srd at the end of July were good, we decided to dedicate ourselves to srd-ita.
To complete the [http://wiki.apertium.org/wiki/Catalan_and_Sardinian/Work_plan "Work Plan"], it was decided to do something even in the month of August for improving the translator from Italian to Sardinian, assuming that the objectives of the first longest phase were achieved on schedule. When we could verify that the results for cat-srd at the end of July were good, we spent some time in the translator from Italian to Sardinian.


==First phase: Apertium cat-srd (May, 29th - July, 29th)==
==First phase: Apertium cat-srd (May, 29th - July, 29th)==
The first coding phase lasting until the second evaluation of GSoC concerned Catalan-Sardinian translator. The translator, thanks to the work done by Francis Tyers, was initially in the "staging" section and started from a number of 2645 in the bilingual dictionary, a trimmed coverage of about 77% and a WER error rate of 34.8 %. The goal was to get 90% coverage and to lower the WER to less than 15%.
The first coding phase, lasting until the second GSoC evaluation, concerned Catalan-Sardinian translator. The translator, thanks to the work done by Francis Tyers, was initially in the [https://svn.code.sf.net/p/apertium/svn/staging/apertium-cat-srd/ "staging"] section and had 2645 lemma in the bilingual dictionary, a [http://wiki.apertium.org/wiki/Calculating_coverage Coverage] of about 77% and a [http://wiki.apertium.org/wiki/WER WER] error rate of 34.8 %. The goal was to get 90% coverage and to lower the WER to less than 15%.


Unlike last year, during which to develop the translator it was necessary to develop almost all the morphological dictionary and also improve aspects of the Italian morphological analyzer, this year started from two languages already developed on the Apertium platform. We could focus mainly on transferring from one language to another, speaking of words, morphological and syntactic structures.
Unlike last year, in which it was necessary to develop almost the whole Sardinian morphological dictionary and also to improve aspects of the Italian morphological analyzer, this year we started from two already developed languages on the Apertium platform. We could focus mainly on transferring from one language to another, i.e. words, morphological and syntactic structures.


In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between Italian and Sardinian has been done to create "pending tests". Also, referring to the "pending test" of ita-srd, structural differences in numerals, possessive forms, duty formulas and continuous tences, past tences, future, conditional and clitic were highlighted.
In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between Catalan and Sardinian has been done to create "pending tests". Also, referring to the "pending test" of cat-srd, structural differences in numerals, possessive forms, duty formulas and continuous tenses, past tenses, future, conditional and clitic were highlighted.


===Sardinian morphological dictionary===
===Sardinian morphological dictionary===
Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the [http://www.sardegnacultura.it/cds/cros-lsc/CROS] lemmas, developed during the previous [https://apertium.projectjj.com/gsoc2016/gfro3d.html/ GSoC 2016]. There were added 15,500 more words: 1300 nouns, 800 adjectives, 300 adverbs 250 verbs and 12,500 own names. It is, for the most part, scientific and technical terminology, and socio-political vocabulary. Except for proper names, for which other criteria were followed, the selection of the words to be introduced was made by the [https://ca.wikipedia.org/wiki/Portada/Catalan Wikipedia].
Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the [http://www.sardegnacultura.it/cds/cros-lsc/ CROS] lemmas, developed in the previous year. 15,500 more words have been added: c. 1300 nouns, 800 adjectives, 300 adverbs 250 verbs and 12,500 proper names. This is, for the most part, scientific and technical terminology, and socio-political vocabulary. Except for proper names, for which other criteria were followed, the selection of the words to be introduced was made from a [https://ca.wikipedia.org/wiki/Portada/Catalan Wikipedia] corpus.

Then the dictionary was adjusted, removing many words that were not normative and correcting mistakes in the assignment of paradigms (especially speaking of the genre assigned to some nouns).
Then the Sardinian dictionary have been adjusted, removing [https://sc.wikipedia.org/wiki/Limba_Sarda_Comuna non-standard (LSC)] words and correcting mistakes in the assignment of paradigms (especially speaking of the genre assigned to some nouns).


===Catalan morphological dictionary===
===Catalan morphological dictionary===
Also in the Catalan morphological dictionary there was an addition of proper names, almost 10,000.
Also in the Catalan morphological dictionary proper names have been added, almost 10,000.


===Morphological disambiguation in catalan===
===Morphological disambiguation in Catalan===
In Catalan dictionary they were written 15 of morphological disambiguation rules and someone else has been modified.
15 new morphological disambiguation rules for Catalan have been written and a few more have been modified.


===Lexical selection rules===
===Lexical selection rules===
The translator has 274 lexical selection rules. These are rules that choose which of two or more possible translations is most appropriate in a given context.
The translator has 274 lexical selection rules. These are rules that choose between two or more possible translations in a given context.


===Transfer rules===
===Transfer rules===
The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example:
The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example:


El meu llibre > Su libru meu
<li>El '''meu''' llibre > Su libru '''meu'''
Vaig menjar > Apo mandigadu
He anat > So andadu
Vull saludar-lo > Lu chèrgio saludare


<li>'''Vaig''' menjar > '''Apo''' mandigadu
===Quality===
The quality assessment is used to see how the translator works in practice. There are many ways to do it, and the choice of the texts depends on how the translator will be used: simply, you need to calculate how many words you have to change in order to publish the text.


<li>'''He''' anat > '''So''' andadu
''Word Error Rate (WER)'' is the indicator that indicates the words that need to be changed in order to publish the text. According to the Work Plan, the goal was to get a lower percentage of 15%. The rate of errors in translation is 13.9% (number obtained by WER indicator calculated on texts taken randomly of 600 words from Wikipedia).


<li>Vull saludar-'''lo''' > '''Lu''' chèrgio saludare
''Translator coverage'' (percentage of recognized words) is 94% (number obtained from a large corpus of Wikipedia).


===Quality===
Quality assessment is used to see how the translator works in practice.


''[http://wiki.apertium.org/wiki/WER Word Error Rate (WER)]'' is an indicator that shows the words that need to be changed in order to publish the text. According to the Work Plan, the goal was to get a percentage lower than 15%. The rate of errors in translation is 13.9% (number obtained by the WER indicator calculated on randomly taken texts from Wikipedia - 600 words).
===Text in Catalan (chosen randomly)===

The ''[http://wiki.apertium.org/wiki/Calculating_coverage Coverage]'' (percentage of recognized words) is 94% (number obtained from a large Wikipedia corpus).

====Text in Catalan (chosen randomly)====
L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes.
L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes.


===Machine-translation to Sardinian===
====Machine translation to Sardinian====
S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene.
S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene.


==Second phase: Apertium srd-ita (August, 29th)==
==Second phase: Apertium srd-ita (July, 29th - August, 29th)==
In the second phase of the project we worked in preparation for a new translator srd-ita. What we could do was start a manual morphological disambiguation of the corpora that has to help the translator to recognize the correct morphology of each word, especially in texts which don't respect the standard orthographic LSC.
In the second phase of the project we have put the basis of a new translator from Sardinian to Italian. We started a manual morphological disambiguation of the corpora that will help the translator to recognize the correct morphology of each word.

We have treated two corpora: one journalist and more dialect, and other taken directly from literary texts written in perfect LSC. Of the first, 6000 words were added, of the second 11800.
We have treated two corpora: one journalistic and more dialectal, and other taken directly from literary texts written in model LSC. From the first one, 6000 words were disambiguated, from the second one 11800.


They have been added 9 new transfer rules and correct some of those that were already there. Now tenses from the Sardinian to Italian are translated correctly. It also improved the translation of possessive and is some cases of enclitics (Sardinian there may be up to three enclitics, whereas in Italian can not there be more than two).
9 new transfer rules have been added and we have corrected some previously written. Now verb tenses are translated correctly from Sardinian to Italian. The translation of possessives and enclitics has been also improved (Sardinian has up to three enclitics, whereas in Italian there cannot be more than two).


We did something even in the Italian dictionary, adding 4 rules of morphological disambiguation (a very important was the disambiguation of "sono" as "so" and "sunt"). Additionally, we have added a list of countries in the world (who gave us Diegu Corràine) and we have obtained the corresponding gentiles. Since the beginning of GSoC 2017 1400 words have been added to the bilingual dictionary ita-srd. The cleaning of the Sardinian dictionary from the mistakes and adding new entries will quickly develop a new version of ita-srd dictionary.
By the way, we also improved a bit the Italian morphological analyzer, adding 4 morphological disambiguation rules (for disambiguating "sono" as "so" or "sunt"). Additionally, we have added a list of countries in the world (given by Diegu Corràine) and we have obtained the corresponding denonyms. Since the beginning of GSoC 2017 1400 words have been added to the bilingual dictionary ita-srd. The cleaning of the Sardinian dictionary from the mistakes and adding new entries will permit to create soon a new version of ita-srd translator.


==Resources==
==Resources==


* [http://www.sardegnacultura.it/cds/cros-lsc/ Correctore ortogràficu LSC]
* [http://www.sardegnacultura.it/cds/cros-lsc/ Curretore ortogràficu LSC]
* [http://www.sardegnacultura.it/documenti/7_81_20080107092727.pdf Dizionario universale della lingua di Sardegna Italiano-Sardo-Italiano, Edes, 2006, Cagliari]
* [http://www.sardegnacultura.it/documenti/7_81_20080107092727.pdf Dizionario universale della lingua di Sardegna Italiano-Sardo-Italiano, Edes, 2006, Cagliari]
* [http://www.sardegnacultura.it/documenti/7_25_20060427093224.pdf Normativa ortografica Limba Sarda Comuna]
* [http://www.sardegnacultura.it/documenti/7_25_20060427093224.pdf Normativa ortografica Limba Sarda Comuna]

Latest revision as of 15:56, 2 September 2017

Google Summer of Code 2017 Gianfranco Fronteddu Final report[edit]

Work and Commits[edit]

You can see my work including the code and a full list of commits here: https://apertium.projectjj.com/gsoc2017/gfro3d.html.

In this repository you can find apertium cat-srd https://svn.code.sf.net/p/apertium/svn/trunk/apertium-cat-srd

Student Information[edit]

Name: Gianfranco Fronteddu

Location: Casteddu, Sardigna

E-mail: gfro3d@gmail.com

IRC: gianfranco

SourceForge: gfro3d

Telegram: gianfro4moros

Skype: gianfranco.fronteddu88

Description[edit]

The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation from Catalan to Sardinian and the continuation of last year's project, apertium ita-srd. This idea comes from the desire to develop another tool to help Sardinian language, as done in the last year.

As can be seen in the "Work Plan", there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and another one, shorter, held in August, to start preparing a new Sardinian-Italian translator.

To complete the "Work Plan", it was decided to do something even in the month of August for improving the translator from Italian to Sardinian, assuming that the objectives of the first longest phase were achieved on schedule. When we could verify that the results for cat-srd at the end of July were good, we spent some time in the translator from Italian to Sardinian.

First phase: Apertium cat-srd (May, 29th - July, 29th)[edit]

The first coding phase, lasting until the second GSoC evaluation, concerned Catalan-Sardinian translator. The translator, thanks to the work done by Francis Tyers, was initially in the "staging" section and had 2645 lemma in the bilingual dictionary, a Coverage of about 77% and a WER error rate of 34.8 %. The goal was to get 90% coverage and to lower the WER to less than 15%.

Unlike last year, in which it was necessary to develop almost the whole Sardinian morphological dictionary and also to improve aspects of the Italian morphological analyzer, this year we started from two already developed languages on the Apertium platform. We could focus mainly on transferring from one language to another, i.e. words, morphological and syntactic structures.

In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between Catalan and Sardinian has been done to create "pending tests". Also, referring to the "pending test" of cat-srd, structural differences in numerals, possessive forms, duty formulas and continuous tenses, past tenses, future, conditional and clitic were highlighted.

Sardinian morphological dictionary[edit]

Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the CROS lemmas, developed in the previous year. 15,500 more words have been added: c. 1300 nouns, 800 adjectives, 300 adverbs 250 verbs and 12,500 proper names. This is, for the most part, scientific and technical terminology, and socio-political vocabulary. Except for proper names, for which other criteria were followed, the selection of the words to be introduced was made from a Wikipedia corpus.

Then the Sardinian dictionary have been adjusted, removing non-standard (LSC) words and correcting mistakes in the assignment of paradigms (especially speaking of the genre assigned to some nouns).

Catalan morphological dictionary[edit]

Also in the Catalan morphological dictionary proper names have been added, almost 10,000.

Morphological disambiguation in Catalan[edit]

15 new morphological disambiguation rules for Catalan have been written and a few more have been modified.

Lexical selection rules[edit]

The translator has 274 lexical selection rules. These are rules that choose between two or more possible translations in a given context.

Transfer rules[edit]

The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example:

  • El meu llibre > Su libru meu
  • Vaig menjar > Apo mandigadu
  • He anat > So andadu
  • Vull saludar-lo > Lu chèrgio saludare

    Quality[edit]

    Quality assessment is used to see how the translator works in practice.

    Word Error Rate (WER) is an indicator that shows the words that need to be changed in order to publish the text. According to the Work Plan, the goal was to get a percentage lower than 15%. The rate of errors in translation is 13.9% (number obtained by the WER indicator calculated on randomly taken texts from Wikipedia - 600 words).

    The Coverage (percentage of recognized words) is 94% (number obtained from a large Wikipedia corpus).

    Text in Catalan (chosen randomly)[edit]

    L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes.

    Machine translation to Sardinian[edit]

    S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene.

    Second phase: Apertium srd-ita (July, 29th - August, 29th)[edit]

    In the second phase of the project we have put the basis of a new translator from Sardinian to Italian. We started a manual morphological disambiguation of the corpora that will help the translator to recognize the correct morphology of each word.

    We have treated two corpora: one journalistic and more dialectal, and other taken directly from literary texts written in model LSC. From the first one, 6000 words were disambiguated, from the second one 11800.

    9 new transfer rules have been added and we have corrected some previously written. Now verb tenses are translated correctly from Sardinian to Italian. The translation of possessives and enclitics has been also improved (Sardinian has up to three enclitics, whereas in Italian there cannot be more than two).

    By the way, we also improved a bit the Italian morphological analyzer, adding 4 morphological disambiguation rules (for disambiguating "sono" as "so" or "sunt"). Additionally, we have added a list of countries in the world (given by Diegu Corràine) and we have obtained the corresponding denonyms. Since the beginning of GSoC 2017 1400 words have been added to the bilingual dictionary ita-srd. The cleaning of the Sardinian dictionary from the mistakes and adding new entries will permit to create soon a new version of ita-srd translator.

    Resources[edit]

    Future plans[edit]

    The work of the second phase of the project will be used for the creation of a more accurate and updated version of ita-srd.