Sardinian and Italian/Final Report
Commit
In the following link is opened a page where you can access the skeleton of which is formed the translator produced in the project and the timeline for various commits by Gianfranco Fronteddu and his mentors, Hèctor Alòs the Font and Francis Tyers, during the period of duration of the project, following the timing and deadlines of the Google Summer of Code program. https://apertium.projectjj.com/gsoc2016/gfro3d.html
Description
The project I'm going to describe is aimed at the creation of a Rule-Based Machine Translation engine from Italian to Sardinian. Is a collaboration between the Autonomous University of Barcelona and Prompsit, funded by Google via the program Google Summer of Code. The creation of a machine translation system in Sardinian language sees the characteristics of this language particularly suitable for various reasons. First, because it is a language in process of standardization, so both the linguistic resources (written documents and reference works) and technological (corpus, publishing products) are scarce. Second, the lack of texts drawn up in accordance with the rules of spelling and vocabulary proposed by the new standard form (Limba Sarda Comuna) makes it necessary to opt for a machine translation system based on rules. Based on a system of transfer rules and dictionaries written in markup language, Apertium is a platform that is well suited to the translation of language pairs belonging to the same language family (Romance languages), such as the Sardinian and Italian, and this work will lay the foundation for, in the near future, it will be possible to operate in the translation of other language pairs as Sardinian-Catalan and Sardinian-Spanish.
Sardinian Language
The Sardinian language is a neo-Latin language spoken in Sardinia, which has an area of 24,100 km 2 and is the second largest island in the Mediterranean Sea. It has about a million speakers. The Sardinian has had development that has given its characteristics. However, the stay of the various peoples that have taken place over the centuries have meant that the Sardinian, even today, present the influences languages such as Catalan, Spanish and Italian. Recently, it has been recognized by UNESCO as a minority language in danger. Given the state of great linguistic fragmentation of the language, it was decided to use the proposed spelling rule LSC (limba sarda comuna), created and recognized by the Autonomous Region of Sardinia in 2006. During the "Coding Challenge", held during the months of March and April, taking advantage of the existing Italian dictionary, it was created the skeleton of the new Sardinian dictionary, in which was imported in a part of the vocabulary and have been included morphological information regarding the formation of all the words (paradigms). In order to proceed with the creation of the new Sardinian dictionary it was necessary to take advantage of the various resources offered by the web and for the lexical analysis and selection contrastive was providential creating corpora consist of texts written in the LSC variant, taken from magazines on -line as "Limbanatziones", "Sa Gazeta", "Sa limba sarda" or the same Wikipedia in the Sardinian language. Particularly useful was the CROS (CROS - Regional Curretore ortogràficu sardu online) that, besides acting as a spell protractor, provided us with a consistent base data from the lexical point of view in the LSC and a valid model for the creation and assignment paradigms.
Risorse
- Correctore ortogràficu LSC
- Dizionario universale della lingua di Sardegna Italiano-Sardo-Italiano, Edes, 2006, Cagliari
- Normativa ortografica Limba Sarda Comuna
- Analitzadore hunspell
- Glossari italià-sard
- Limbanaztiones.com
- Sardo Logudorese-Italiano
Italian Language
Regarding the Italian language, it was already present in Apertium an Italian dictionary, which, however, has been subjected to a process of revision and updating. It is needed to do a great job of finishing for the case of closed categories and the creation and reassignment of some paradigms, especially those verbal. A particularly significant contribution has been given by Prompsit, specifically by Gema Ramírez-Sánchez and Marina Loffredo, who finding themselves, by chance, to work together with us in the Italian-Spanish translator, they were able to develop and deliver, in the months July and August, a morphological disambiguation system for the Italian. We have contributed to the development of the latter adding 30 disambiguation rules.