Difference between revisions of "Apertium cat-srd and ita-srd/GSoC 2017"
(28 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== Google Summer of Code 2017 Gianfranco Fronteddu== |
== Google Summer of Code 2017 Gianfranco Fronteddu Final report== |
||
===Work and Commits=== |
===Work and Commits=== |
||
You can see a full list of commits |
You can see my work including the code and a full list of commits here: https://apertium.projectjj.com/gsoc2017/gfro3d.html. |
||
In this repository you can find apertium cat-srd https://svn.code.sf.net/p/apertium/svn/trunk/apertium-cat-srd |
|||
== Student Information == |
== Student Information == |
||
Line 21: | Line 23: | ||
==Description== |
==Description== |
||
The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation |
The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation from Catalan to Sardinian and the continuation of last year's project, apertium ita-srd. This idea comes from the desire to develop another tool to help Sardinian language, as done in the last year. |
||
As can be seen in the "Work Plan", there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and |
As can be seen in the [http://wiki.apertium.org/wiki/Catalan_and_Sardinian/Work_plan "Work Plan"], there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and another one, shorter, held in August, to start preparing a new Sardinian-Italian translator. |
||
To complete the |
To complete the [http://wiki.apertium.org/wiki/Catalan_and_Sardinian/Work_plan "Work Plan"], it was decided to do something even in the month of August for improving the translator from Italian to Sardinian, assuming that the objectives of the first longest phase were achieved on schedule. When we could verify that the results for cat-srd at the end of July were good, we spent some time in the translator from Italian to Sardinian. |
||
==First phase: Apertium cat-srd (May, 29th - July, 29th)== |
==First phase: Apertium cat-srd (May, 29th - July, 29th)== |
||
The first coding phase lasting until the second |
The first coding phase, lasting until the second GSoC evaluation, concerned Catalan-Sardinian translator. The translator, thanks to the work done by Francis Tyers, was initially in the [https://svn.code.sf.net/p/apertium/svn/staging/apertium-cat-srd/ "staging"] section and had 2645 lemma in the bilingual dictionary, a [http://wiki.apertium.org/wiki/Calculating_coverage Coverage] of about 77% and a [http://wiki.apertium.org/wiki/WER WER] error rate of 34.8 %. The goal was to get 90% coverage and to lower the WER to less than 15%. |
||
Unlike last year, |
Unlike last year, in which it was necessary to develop almost the whole Sardinian morphological dictionary and also to improve aspects of the Italian morphological analyzer, this year we started from two already developed languages on the Apertium platform. We could focus mainly on transferring from one language to another, i.e. words, morphological and syntactic structures. |
||
In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between |
In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between Catalan and Sardinian has been done to create "pending tests". Also, referring to the "pending test" of cat-srd, structural differences in numerals, possessive forms, duty formulas and continuous tenses, past tenses, future, conditional and clitic were highlighted. |
||
===Sardinian morphological dictionary=== |
===Sardinian morphological dictionary=== |
||
Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the |
Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the [http://www.sardegnacultura.it/cds/cros-lsc/ CROS] lemmas, developed in the previous year. 15,500 more words have been added: c. 1300 nouns, 800 adjectives, 300 adverbs 250 verbs and 12,500 proper names. This is, for the most part, scientific and technical terminology, and socio-political vocabulary. Except for proper names, for which other criteria were followed, the selection of the words to be introduced was made from a [https://ca.wikipedia.org/wiki/Portada/Catalan Wikipedia] corpus. |
||
Then the dictionary |
Then the Sardinian dictionary have been adjusted, removing [https://sc.wikipedia.org/wiki/Limba_Sarda_Comuna non-standard (LSC)] words and correcting mistakes in the assignment of paradigms (especially speaking of the genre assigned to some nouns). |
||
===Catalan morphological dictionary=== |
===Catalan morphological dictionary=== |
||
Also in the Catalan morphological dictionary |
Also in the Catalan morphological dictionary proper names have been added, almost 10,000. |
||
===Morphological disambiguation in |
===Morphological disambiguation in Catalan=== |
||
15 new morphological disambiguation rules for Catalan have been written and a few more have been modified. |
|||
===Lexical selection rules=== |
===Lexical selection rules=== |
||
The translator has 274 lexical selection rules. These are rules that choose |
The translator has 274 lexical selection rules. These are rules that choose between two or more possible translations in a given context. |
||
===Transfer rules=== |
===Transfer rules=== |
||
The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example: |
The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example: |
||
El meu llibre > Su libru meu |
<li>El '''meu''' llibre > Su libru '''meu''' |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
The quality assessment is used to see how the translator works in practice. There are many ways to do it, and the choice of the texts depends on how the translator will be used: simply, you need to calculate how many words you have to change in order to publish the text. |
|||
⚫ | |||
⚫ | ''Word Error Rate (WER)'' is |
||
⚫ | |||
⚫ | |||
⚫ | |||
Quality assessment is used to see how the translator works in practice. |
|||
⚫ | ''[http://wiki.apertium.org/wiki/WER Word Error Rate (WER)]'' is an indicator that shows the words that need to be changed in order to publish the text. According to the Work Plan, the goal was to get a percentage lower than 15%. The rate of errors in translation is 13.9% (number obtained by the WER indicator calculated on randomly taken texts from Wikipedia - 600 words). |
||
⚫ | |||
⚫ | |||
⚫ | |||
L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes. |
L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes. |
||
===Machine |
====Machine translation to Sardinian==== |
||
S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene. |
S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene. |
||
==Second phase: Apertium srd-ita (August, 29th)== |
==Second phase: Apertium srd-ita (July, 29th - August, 29th)== |
||
In the second phase of the project we |
In the second phase of the project we have put the basis of a new translator from Sardinian to Italian. We started a manual morphological disambiguation of the corpora that will help the translator to recognize the correct morphology of each word. |
||
We have treated two corpora: one |
We have treated two corpora: one journalistic and more dialectal, and other taken directly from literary texts written in model LSC. From the first one, 6000 words were disambiguated, from the second one 11800. |
||
9 new transfer rules have been added and we have corrected some previously written. Now verb tenses are translated correctly from Sardinian to Italian. The translation of possessives and enclitics has been also improved (Sardinian has up to three enclitics, whereas in Italian there cannot be more than two). |
|||
By the way, we also improved a bit the Italian morphological analyzer, adding 4 morphological disambiguation rules (for disambiguating "sono" as "so" or "sunt"). Additionally, we have added a list of countries in the world (given by Diegu Corràine) and we have obtained the corresponding denonyms. Since the beginning of GSoC 2017 1400 words have been added to the bilingual dictionary ita-srd. The cleaning of the Sardinian dictionary from the mistakes and adding new entries will permit to create soon a new version of ita-srd translator. |
|||
==Resources== |
==Resources== |
||
* [http://www.sardegnacultura.it/cds/cros-lsc/ |
* [http://www.sardegnacultura.it/cds/cros-lsc/ Curretore ortogràficu LSC] |
||
* [http://www.sardegnacultura.it/documenti/7_81_20080107092727.pdf Dizionario universale della lingua di Sardegna Italiano-Sardo-Italiano, Edes, 2006, Cagliari] |
* [http://www.sardegnacultura.it/documenti/7_81_20080107092727.pdf Dizionario universale della lingua di Sardegna Italiano-Sardo-Italiano, Edes, 2006, Cagliari] |
||
* [http://www.sardegnacultura.it/documenti/7_25_20060427093224.pdf Normativa ortografica Limba Sarda Comuna] |
* [http://www.sardegnacultura.it/documenti/7_25_20060427093224.pdf Normativa ortografica Limba Sarda Comuna] |
||
Line 85: | Line 92: | ||
* [http://limbasnatziones.tempusnostru.it/home.page/ Limbanaztiones.com] |
* [http://limbasnatziones.tempusnostru.it/home.page/ Limbanaztiones.com] |
||
* [http://http://vocabolariocasu.isresardegna.it/Vocabolario Sardo Logudorese-Italiano] |
* [http://http://vocabolariocasu.isresardegna.it/Vocabolario Sardo Logudorese-Italiano] |
||
* [http://dlc.iec.cat/Institut d'Estudis Catalans: Diccionari de la llengua catalana] |
* [http://dlc.iec.cat/ Institut d'Estudis Catalans: Diccionari de la llengua catalana] |
||
* [http://www.sagazeta.info/Sa Gazeta] |
* [http://www.sagazeta.info/ Sa Gazeta] |
||
==Future plans== |
==Future plans== |
Latest revision as of 15:56, 2 September 2017
Contents
Google Summer of Code 2017 Gianfranco Fronteddu Final report[edit]
Work and Commits[edit]
You can see my work including the code and a full list of commits here: https://apertium.projectjj.com/gsoc2017/gfro3d.html.
In this repository you can find apertium cat-srd https://svn.code.sf.net/p/apertium/svn/trunk/apertium-cat-srd
Student Information[edit]
Name: Gianfranco Fronteddu
Location: Casteddu, Sardigna
E-mail: gfro3d@gmail.com
IRC: gianfranco
SourceForge: gfro3d
Telegram: gianfro4moros
Skype: gianfranco.fronteddu88
Description[edit]
The project for participation in the Google Summer of Code 2017 program with Apertium was the development of a Ruled-Based Machine Translation from Catalan to Sardinian and the continuation of last year's project, apertium ita-srd. This idea comes from the desire to develop another tool to help Sardinian language, as done in the last year.
As can be seen in the "Work Plan", there were two phases: one longer, which lasted during June and July, devoted to Catalan Sardinian and another one, shorter, held in August, to start preparing a new Sardinian-Italian translator.
To complete the "Work Plan", it was decided to do something even in the month of August for improving the translator from Italian to Sardinian, assuming that the objectives of the first longest phase were achieved on schedule. When we could verify that the results for cat-srd at the end of July were good, we spent some time in the translator from Italian to Sardinian.
First phase: Apertium cat-srd (May, 29th - July, 29th)[edit]
The first coding phase, lasting until the second GSoC evaluation, concerned Catalan-Sardinian translator. The translator, thanks to the work done by Francis Tyers, was initially in the "staging" section and had 2645 lemma in the bilingual dictionary, a Coverage of about 77% and a WER error rate of 34.8 %. The goal was to get 90% coverage and to lower the WER to less than 15%.
Unlike last year, in which it was necessary to develop almost the whole Sardinian morphological dictionary and also to improve aspects of the Italian morphological analyzer, this year we started from two already developed languages on the Apertium platform. We could focus mainly on transferring from one language to another, i.e. words, morphological and syntactic structures.
In observing the dates of the GSoC 2017 program, in May and in the first week of June ("Community Bounding") a great deal of contrasting analysis between Catalan and Sardinian has been done to create "pending tests". Also, referring to the "pending test" of cat-srd, structural differences in numerals, possessive forms, duty formulas and continuous tenses, past tenses, future, conditional and clitic were highlighted.
Sardinian morphological dictionary[edit]
Regarding the Sardinian morphological dictionary, we already had a dictionary of 51,800 words (including the CROS lemmas, developed in the previous year. 15,500 more words have been added: c. 1300 nouns, 800 adjectives, 300 adverbs 250 verbs and 12,500 proper names. This is, for the most part, scientific and technical terminology, and socio-political vocabulary. Except for proper names, for which other criteria were followed, the selection of the words to be introduced was made from a Wikipedia corpus.
Then the Sardinian dictionary have been adjusted, removing non-standard (LSC) words and correcting mistakes in the assignment of paradigms (especially speaking of the genre assigned to some nouns).
Catalan morphological dictionary[edit]
Also in the Catalan morphological dictionary proper names have been added, almost 10,000.
Morphological disambiguation in Catalan[edit]
15 new morphological disambiguation rules for Catalan have been written and a few more have been modified.
Lexical selection rules[edit]
The translator has 274 lexical selection rules. These are rules that choose between two or more possible translations in a given context.
Transfer rules[edit]
The translator has 78 transfer rules. These are rules that modify the structure of the sentence in Catalan to fit the structure needed in Sardinian. For example:
Quality[edit]
Quality assessment is used to see how the translator works in practice.
Word Error Rate (WER) is an indicator that shows the words that need to be changed in order to publish the text. According to the Work Plan, the goal was to get a percentage lower than 15%. The rate of errors in translation is 13.9% (number obtained by the WER indicator calculated on randomly taken texts from Wikipedia - 600 words).
The Coverage (percentage of recognized words) is 94% (number obtained from a large Wikipedia corpus).
Text in Catalan (chosen randomly)[edit]
L'Acròpoli d'Atenes és l'acròpoli grega més important. L'Acròpoli era, literalment, la “ciutat alta” i estava present a la majoria de ciutats gregues, amb una doble funció: defensiva i com a seu dels principals llocs de culte. L'Acròpoli d'Atenes està situada sobre un turó a uns 165 metres per sobre del nivell de la ciutat. També és coneguda com a Cecròpia en honor del llegendari home serp, Cècrops, rei d'Atenes.
Machine translation to Sardinian[edit]
S'Acròpoli de Atene est s'acròpoli grega prus importante. S'Acròpoli fiat, literalmente, sa “tzitade arta” e fiat presente a sa majoria de tzitades gregas, cun una dòpia funtzione: difensora e comente a sede de sos printzipales logos de cultu. S'Acròpoli de Atene est situada subra unu montigru a unos 165 metros in subra de su livellu de sa tzitade. Puru est connota comente a Cecròpia in onore de su legendàriu òmine colovra, Cècrops, re de Atene.
Second phase: Apertium srd-ita (July, 29th - August, 29th)[edit]
In the second phase of the project we have put the basis of a new translator from Sardinian to Italian. We started a manual morphological disambiguation of the corpora that will help the translator to recognize the correct morphology of each word.
We have treated two corpora: one journalistic and more dialectal, and other taken directly from literary texts written in model LSC. From the first one, 6000 words were disambiguated, from the second one 11800.
9 new transfer rules have been added and we have corrected some previously written. Now verb tenses are translated correctly from Sardinian to Italian. The translation of possessives and enclitics has been also improved (Sardinian has up to three enclitics, whereas in Italian there cannot be more than two).
By the way, we also improved a bit the Italian morphological analyzer, adding 4 morphological disambiguation rules (for disambiguating "sono" as "so" or "sunt"). Additionally, we have added a list of countries in the world (given by Diegu Corràine) and we have obtained the corresponding denonyms. Since the beginning of GSoC 2017 1400 words have been added to the bilingual dictionary ita-srd. The cleaning of the Sardinian dictionary from the mistakes and adding new entries will permit to create soon a new version of ita-srd translator.
Resources[edit]
- Curretore ortogràficu LSC
- Dizionario universale della lingua di Sardegna Italiano-Sardo-Italiano, Edes, 2006, Cagliari
- Normativa ortografica Limba Sarda Comuna
- Analitzadore hunspell
- Glossàriu italianu-sardu
- Limbanaztiones.com
- Sardo Logudorese-Italiano
- Institut d'Estudis Catalans: Diccionari de la llengua catalana
- Sa Gazeta
Future plans[edit]
The work of the second phase of the project will be used for the creation of a more accurate and updated version of ita-srd.