Difference between revisions of "Sardinian and Italian/Final Report"

From Apertium
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
===Commit===
===Commit===
In the following link is opened a page where you can access the skeleton of which is formed the translator produced in the project and the timeline for various commits by Gianfranco Fronteddu and his mentors, Hèctor Alòs the Font and Francis Tyers, during the period of duration of the project, following the timing and deadlines of the Google Summer of Code program.
In the following link is opened a page where you can access the skeleton of which is formed the translator produced in the project and the timeline for various commits by Gianfranco Fronteddu and his mentors, Hèctor Alòs the Font and Francis Tyers, during the period of duration of the project, following the timing and deadlines of the Google Summer of Code program.
https://apertium.projectjj.com/gsoc2016/gfro3d.html
https://apertium.projectjj.com/gsoc2016/gfro3d.html


===Description===
==Description==
The project I'm going to describe is aimed at the creation of a Rule-Based Machine Translation engine from Italian to Sardinian. Is a collaboration between the Autonomous University of Barcelona and Prompsit, funded by Google via the program ''Google Summer of Code''.
The project I'm going to describe is aimed at the creation of a Rule-Based Machine Translation engine from Italian to Sardinian. Is a collaboration between the Autonomous University of Barcelona and Prompsit, funded by Google via the program ''Google Summer of Code''.
The creation of a machine translation system in Sardinian language sees the characteristics of this language particularly suitable for various reasons. First, because it is a language in process of standardization, so both the linguistic resources (written documents and reference works) and technological (corpus, publishing products) are scarce. Second, the lack of texts drawn up in accordance with the rules of spelling and vocabulary proposed by the new standard form (Limba Sarda Comuna) makes it necessary to opt for a machine translation system based on rules.
The creation of a machine translation system in Sardinian language sees the characteristics of this language particularly suitable for various reasons. First, because it is a language in process of standardization, so both the linguistic resources (written documents and reference works) and technological (corpus, publishing products) are scarce. Second, the lack of texts drawn up in accordance with the rules of spelling and vocabulary proposed by the new standard form (Limba Sarda Comuna) makes it necessary to opt for a machine translation system based on rules.
Based on a system of transfer rules and dictionaries written in markup language, Apertium is a platform that is well suited to the translation of language pairs belonging to the same language family (Romance languages), such as the Sardinian and Italian, and this work will lay the foundation for, in the near future, it will be possible to operate in the translation of other language pairs as Sardinian-Catalan and Sardinian-Spanish.
Based on a system of transfer rules and dictionaries written in markup language, Apertium is a platform that is well suited to the translation of language pairs belonging to the same language family (Romance languages), such as the Sardinian and Italian, and this work will lay the foundation for, in the near future, it will be possible to operate in the translation of other language pairs as Sardinian-Catalan and Sardinian-Spanish.

==Sardinian Language==
The Sardinian language is a neo-Latin language spoken in Sardinia, which has an area of 24,100 km <sup>2</sup> and is the second largest island in the Mediterranean Sea. It has about a million speakers. The Sardinian has had development that has given its characteristics. However, the stay of the various peoples that have taken place over the centuries have meant that the Sardinian, even today, present the influences languages such as Catalan, Spanish and Italian. Recently, it has been recognized by UNESCO as a minority language in danger. Given the state of great linguistic fragmentation of the language, it was decided to use the proposed spelling rule LSC (limba sarda comuna), created and recognized by the Autonomous Region of Sardinia in 2006. During the "Coding Challenge", held during the months of March and April, taking advantage of the existing Italian dictionary, it was created the skeleton of the new Sardinian dictionary, in which was imported in a part of the vocabulary and have been included morphological information regarding the formation of all the words (paradigms).
In order to proceed with the creation of the new Sardinian dictionary it was necessary to take advantage of the various resources offered by the web and for the lexical analysis and selection contrastive was providential creating corpora consist of texts written in the LSC variant, taken from magazines on -line as "Limbanatziones", "Sa Gazeta", "Sa limba sarda" or the same Wikipedia in the Sardinian language. Particularly useful was the CROS (CROS - Regional Curretore ortogràficu sardu online) that, besides acting as a spell protractor, provided us with a consistent base data from the lexical point of view in the LSC and a valid model for the creation and assignment paradigms.

==Risorse==


* [http://www.sardegnacultura.it/cds/cros-lsc/ Correctore ortogràficu LSC]
* [http://www.sardegnacultura.it/documenti/7_81_20080107092727.pdf Dizionario universale della lingua di Sardegna Italiano-Sardo-Italiano, Edes, 2006, Cagliari]
* [http://www.sardegnacultura.it/documenti/7_25_20060427093224.pdf Normativa ortografica Limba Sarda Comuna]
* [http://www.sardegnacultura.it/cds/cros-lsc/cros.oxt Analitzadore hunspell]
* [http://www.sardegnacultura.it/documenti/7_108_20090205130512.pdf Glossari italià-sard]
* [http://limbasnatziones.tempusnostru.it/home.page/ Limbanaztiones.com]
* [http://http://vocabolariocasu.isresardegna.it/Vocabolario Sardo Logudorese-Italiano]

==Italian Language==
Regarding the Italian language, it was already present in Apertium an Italian dictionary, which, however, has been subjected to a process of revision and updating. It is needed to do a great job of finishing for the case of closed categories and the creation and reassignment of some paradigms, especially those verbal. A particularly significant contribution has been given by Prompsit, specifically by Gema Ramírez-Sánchez and Marina Loffredo, who finding themselves, by chance, to work together with us in the Italian-Spanish translator, they were able to develop and deliver, in the months July and August, a morphological disambiguation system for the Italian. We have contributed to the development of the latter adding 30 disambiguation rules.

==Bilingual dictionary==
For the compilation of bilingual dictionary have been consulted several dictionaries, including the Antonino Rubattu's Universal Dictionary Italian-Sardinian and Mario Casu's Logudorese-Italian vocabulary and in-depth analysis of parallel corpora that have allowed us to understand what, case by case, the greatest number of occurrences.
The goal was to reach at least 20,000 entries. Currently, the dictionary has 25,484 entries, an achievement of which we are proud.

==Lexical selection rules==
During the last phase it has been carried out a lexical selection able to select and prefer the terms most used, highlighting 1127 translation options as "not preferable" in the dictionary and creating 35 bilingual lexical selection rules.
In order to see which ones were most in use was necessary to consult the various corpora mentioned and create mini statistical analyzes based on the number of occurrences related to the appearance of each lemma.

==Transfer rules==
As for the creation of transfer rules, the first phase was the compilation of the "[[Sardo_e_italiano/Pending_tests | pending test]]" in which, through a contrastive analysis work have been put in evidence of structural differences between Italian and Sardinian. Among other differences, those who requested more attention focused on, for example, the Sardinian verbs, that the [[Sardo_e_italiano/Regression_tests#Cunditzionale|Conditional]], in the forms to the [[Sardo_e_italiano / Regression_tests # Passatu_remotu | past]] and form the [[Sardo_e_italiano/Regression_tests#Futuru|future indicative]], differ from those Italians especially for the increased use of auxiliary and circumlocution (for example: "I will" → "deo apo to fàghere"; " I would do "→" deo give fàghere ". Another interesting case was that of [[Sardo_e_italiano/Pending_tests#N.C3.B9meros_ordinales|ordinal numbers]], which in Italian are expressed by a single term, while in Sardinia with the formula "of de .." ("third "→" of de tres "). In this situation problems have arisen in translating the cases where besides the ordinal numbers also appear the [[Sardo_e_italiano/Pending_tests#Possessivos|possessive adjectives]], especially with regard to the position of words in the syntagmatic order of sentence ("My third home. "→" Sa de tres de sas domos meas. ")

The final result was the creation of 89 transfer rules.

Latest revision as of 11:34, 23 August 2016

Commit[edit]

In the following link is opened a page where you can access the skeleton of which is formed the translator produced in the project and the timeline for various commits by Gianfranco Fronteddu and his mentors, Hèctor Alòs the Font and Francis Tyers, during the period of duration of the project, following the timing and deadlines of the Google Summer of Code program. https://apertium.projectjj.com/gsoc2016/gfro3d.html

Description[edit]

The project I'm going to describe is aimed at the creation of a Rule-Based Machine Translation engine from Italian to Sardinian. Is a collaboration between the Autonomous University of Barcelona and Prompsit, funded by Google via the program Google Summer of Code. The creation of a machine translation system in Sardinian language sees the characteristics of this language particularly suitable for various reasons. First, because it is a language in process of standardization, so both the linguistic resources (written documents and reference works) and technological (corpus, publishing products) are scarce. Second, the lack of texts drawn up in accordance with the rules of spelling and vocabulary proposed by the new standard form (Limba Sarda Comuna) makes it necessary to opt for a machine translation system based on rules. Based on a system of transfer rules and dictionaries written in markup language, Apertium is a platform that is well suited to the translation of language pairs belonging to the same language family (Romance languages), such as the Sardinian and Italian, and this work will lay the foundation for, in the near future, it will be possible to operate in the translation of other language pairs as Sardinian-Catalan and Sardinian-Spanish.

Sardinian Language[edit]

The Sardinian language is a neo-Latin language spoken in Sardinia, which has an area of 24,100 km 2 and is the second largest island in the Mediterranean Sea. It has about a million speakers. The Sardinian has had development that has given its characteristics. However, the stay of the various peoples that have taken place over the centuries have meant that the Sardinian, even today, present the influences languages such as Catalan, Spanish and Italian. Recently, it has been recognized by UNESCO as a minority language in danger. Given the state of great linguistic fragmentation of the language, it was decided to use the proposed spelling rule LSC (limba sarda comuna), created and recognized by the Autonomous Region of Sardinia in 2006. During the "Coding Challenge", held during the months of March and April, taking advantage of the existing Italian dictionary, it was created the skeleton of the new Sardinian dictionary, in which was imported in a part of the vocabulary and have been included morphological information regarding the formation of all the words (paradigms). In order to proceed with the creation of the new Sardinian dictionary it was necessary to take advantage of the various resources offered by the web and for the lexical analysis and selection contrastive was providential creating corpora consist of texts written in the LSC variant, taken from magazines on -line as "Limbanatziones", "Sa Gazeta", "Sa limba sarda" or the same Wikipedia in the Sardinian language. Particularly useful was the CROS (CROS - Regional Curretore ortogràficu sardu online) that, besides acting as a spell protractor, provided us with a consistent base data from the lexical point of view in the LSC and a valid model for the creation and assignment paradigms.

Risorse[edit]

Italian Language[edit]

Regarding the Italian language, it was already present in Apertium an Italian dictionary, which, however, has been subjected to a process of revision and updating. It is needed to do a great job of finishing for the case of closed categories and the creation and reassignment of some paradigms, especially those verbal. A particularly significant contribution has been given by Prompsit, specifically by Gema Ramírez-Sánchez and Marina Loffredo, who finding themselves, by chance, to work together with us in the Italian-Spanish translator, they were able to develop and deliver, in the months July and August, a morphological disambiguation system for the Italian. We have contributed to the development of the latter adding 30 disambiguation rules.

Bilingual dictionary[edit]

For the compilation of bilingual dictionary have been consulted several dictionaries, including the Antonino Rubattu's Universal Dictionary Italian-Sardinian and Mario Casu's Logudorese-Italian vocabulary and in-depth analysis of parallel corpora that have allowed us to understand what, case by case, the greatest number of occurrences. The goal was to reach at least 20,000 entries. Currently, the dictionary has 25,484 entries, an achievement of which we are proud.

Lexical selection rules[edit]

During the last phase it has been carried out a lexical selection able to select and prefer the terms most used, highlighting 1127 translation options as "not preferable" in the dictionary and creating 35 bilingual lexical selection rules. In order to see which ones were most in use was necessary to consult the various corpora mentioned and create mini statistical analyzes based on the number of occurrences related to the appearance of each lemma.

Transfer rules[edit]

As for the creation of transfer rules, the first phase was the compilation of the " pending test" in which, through a contrastive analysis work have been put in evidence of structural differences between Italian and Sardinian. Among other differences, those who requested more attention focused on, for example, the Sardinian verbs, that the Conditional, in the forms to the past and form the future indicative, differ from those Italians especially for the increased use of auxiliary and circumlocution (for example: "I will" → "deo apo to fàghere"; " I would do "→" deo give fàghere ". Another interesting case was that of ordinal numbers, which in Italian are expressed by a single term, while in Sardinia with the formula "of de .." ("third "→" of de tres "). In this situation problems have arisen in translating the cases where besides the ordinal numbers also appear the possessive adjectives, especially with regard to the position of words in the syntagmatic order of sentence ("My third home. "→" Sa de tres de sas domos meas. ")

The final result was the creation of 89 transfer rules.