Grfro3d/proposal apertium cat-srd and ita-srd
Contents
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in Apertium?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 My proposal
- 6 Include time needed to think, to program, to document and to disseminate.
- 7 List your skills and give evidence of your qualifications.
- 8 List any non-Summer-of-Code plans you have for the Summer.
- 9 Coding Challenge
Contact Information
Name: Gianfranco Fronteddu
Location: Casteddu, Sardigna
E-mail: gfro3d@gmail.com
IRC: gianfranco
SourceForge: gfro3d
Telegram: gianfro4moros
Skype: gianfranco.fronteddu88
Why is it you are interested in machine translation?
I’m a Translation student and have always been fascinated by Computational Linguistics during my University studies. We have approached to this field of Linguistics through the courses “Theory and techniques of translating” and “Applied linguistics” at the University of Cagliari.
As a support to the translation, the MT are divided into two groups: one is MT which is used for "assimilation" —the use of machine translation to understand the general meaning of the text in foreign language. The other approach is instead that of "dissemination" in which the MT is an intermediate step in the production of a document in the TL, which will be published. To facilitate this process, it is usual to adopt the controlled languages, namely to establish the phrases with structures not too complex.
An important aspect is free/open-source, which allows using for any purpose software and examining it and then adapting it for the creation of new applications. Therefore, open-source software can be redistributed and improved. Open-source RBMT is, then, very useful for language, thanks to the creation of morphological data such as dictionaries, bilingual dictionaries, grammars and rules and structural transfer files. RBMT systems consist of an engine (coding and decoding), data (linguistic data) and support tools to convert data and make them compatible with the engine. Even if most RBMT systems are private and are born for commercial purposes, open-source RBMT offer the possibility of being able to take advantage of the engine MT, but also to be able to intervene on the code to modify and change the rules.
Finally, the advantages of creating a RBMT system are above all the increase of linguistic resources: information collected for the development of a machine translation is easily reusable for other projects and related technologies. In this way, MT can truly become a good support especially for minoritised languages in danger of extinction.
Why is it that you are interested in Apertium?
The fact that Apertium is an open-source project means that anyone can contribute to its development. This brings about an interesting point related to the involvement of minoritised language communities. Being myself a speaker of a minoritised language, Sardinian, I would like to give my contribution so that my language can become part of the language combinations offered by this tool.
Sardinian is a Romance language deriving from Latin spoken in the island of Sardinia. The Sardinian language is a romance language spoken by approximately one million people in the island of Sardinia. According to Ethnologue, unfortunately, the Sardinian language is in danger of extinction. The linguistic fragmentation and differences between the various dialects have led to a gradual abandonment of Sardinian in favor of the national language, Italian. It resists as the primary language only in some areas of Sardinia, for example, the central ones. The UNESCO Atlas of the World's Languages in Danger [1]. The Limba Sarda Comuna (LSC) has been proposed as the standard form for all varieties of Sardinian. It is an evolved version of the Limba Sarda Unificada (LSU), which was in turn the result of an experts' committee called by the Sardinian government in 2001. In 2006, the Sardinian government adopted the LSC as a co-official language for the publication of official documents. The LSC is also the form chosen by several publishing houses, journals and online sites.
However, other romance languages such as Tabarchino Ligurian (in the islands of San Pé and Sant'Antióccu), Algherese Catalan (in the city of L'Alguer), Sassarese (in the city of Sassari) and Gallurese Corsican (in Gaddùra) are spoken in Sardinia. The Sardinian language and other minoritized language of Sardinia are recognised by the regional government's law n. 26 of 15 October 1997 [2] and by the Italian constitution according to Article 6, "La Repubblica tutela con apposite norme le minoranze linguistiche" ("The Republic safeguards linguistic minorities by means of appropriate measures"), by the National Law n. 482 of 15 December 1999 [3], and "Rules on protection of historical linguistic minorities" [4], make possible for regional governments to use local languages at school.
Catalan is spoken in the Sardinian city of Alghero by about 33,000 speakers, 8,600 active and 25,000 passive. According to a study of the Generalitat de Catalunya, in Alghero Catalan is understood by 60% of the population, while it is only spoken by 20%. The dominant presence of Catalan in Alghero dates back to the XIV century with the expulsion of the Sardinian populations by the hand of the Aragon Catalans. Later, in Sardinia, Catalan assumed a position of prestige. In 1952, Rafael Sari founded the Center d'Estudis Algueresos for the dissemination and teaching of the Catalan language standards in Sardinia. Among the important people who directed the institute are Rafael Catardi and Antoni Simon Mossa. The "Escola de Alguerés Pascual Scanu" was founded by Josep Sanna and offers courses of Catalan language and literature. Among the most important magazines there is L'Alguer, published only in Catalan. I'm interested Apertium because is an open-source platform and because it is suitable for similar romance languages. In this case Cat-Srd and Ita-Srd are perfect cases that meet this requirement. Given the influence these languages they have had between themselves, it is easy to note that the Sardinian language still presents many influences of the Catalan language, because of their coexistence during the period of the Catalan-Aragonese occupation Sardinia. All these languages are part of the Sardinian linguistic heritage. This project would give speakers a valuable tool to improve their skills in standard language and to create new bridges between Sardinia and the linguistic and socio-political realities.
Which of the published tasks are you interested in? What do you plan to do?
I am interested in adopting an unreleased language pair.
Considering my background studies (translation) and my knowledge about MT and Translation Technologies, I plan to improve the language pair cat-srd which at the moment it is in the staging section [5], and keep improving ita>srd that at the moment is in trunk section [6].
My proposal
Title
Proposal apertium cat-srd and ita-srd
Reasons why Google and Apertium should sponsor it
Sardinia is island with a great linguistic wealth. The island five minority languages are spoken. MT and especially a RBMT Apertium could provide additional tools to the island for the preservation of these languages that are in danger of extinction. The release of a version of apertium cat-srd would facilitate linguistic exchanges, in Sardinia and between Sardinia and the Catalan Countries. In Catalan there is a wealth of materials on sociolinguistics and language activism, available only in Catalan, which would be extremely useful for Sardinian language activists, either to increase their knowledge on these issues or to disseminate them among the population. In Catalan there is also a huge amount of studies on Sardinia's history due to the long time that Sardinia belonged to the Crown of Aragon. Besides, the Catalan Wikipedia has more than 500,000 articles. As known, Apertium translators are often used for extending and improving minoritised language Wikipedias. The problem is that translating articles from the majority/hegemonic language (e.g. Italian) into the minoritised (e.g. Sardinian) just gives information which is already available for minority language speakers. The possibility of translating from another language, such as Catalan, will give the opportunity to improve the interest of the Sardinian Wikipedia by means of translating content which is not easily available for the vast majority of Sardinian speakers. However, because of the structure of the Apertium platform, data is easily interchangeable and this kind of work surely would benefit all 3 languages. It would allow the Sardinian speakers to be aware of how the corresponding standard language.
Catalan to Sardinian (apertium-cat-srd)
The project has already started and currently the bidix is in the Staging section. The Catalan language is definitely the one that has the most resources in the Apertium platform. The cat-srd bidix has 2645 words with a trimmed coverage of 77.7% and a Wer rate of 34.8%: my goal is to bring coverage up to 86,5% and WER to 15%. The work that I think I can make concerns the increased coverage in bidix, increasing the number of terms translated from the Catalan to the Sardinian adding 2000 word/week manually up to 18.000 words to the bidix. I would work on transfer rules (for tenses, NP with possessives, proclitics, enclitics, ordinals, superlatives, etc.) and lexical selection. It would be better to spend more time on apertium cat>srd that in apertium ita>srd (which is easier) in order to have chances of having a new translator ready or almost ready for release at the end of the project.
Apertium Italian to Sardinian (apertium-ita-srd)
Last year, thanks to Google Summer of Code program, the first Italian-Sardinian translator was released, apertium ita-srd. The project was a success. All the goals have been achieved: the creation of corpora in LSC; italian monodix: apertium-srd-srd.dix: 51,743 words; apertium-ita-ita.dix: 35,099 words; bidix (apertium-srd-ita.srd-ita.dix) 25,484 words (20,000 in work plan); Coverage: 89.3% e WER: 10.79% [7]. The pending test page that served the creation of transfer rules 89 and 35 of the lexical selection rules.
Now coverage is 89.1% and WER is 31,9%. Bidix: 25,500 words: In the Italian to Sardinian side of the translation, the current figures are: coverage 89.1%, WER 31.9%. There are 35 transfer rules (but most of them have not been tested), and there are not any lexical selection rules. There has been no work until now on morphological disambiguation in Sardinian.
As for apertium ita> srd the goal is to focus on transfer rules: through the structural differences highlighted by the completion of pending tests, this time srd>it requires the construction of new transfer rules that allow to solve problems like enclitics and proclítics (Sardinian there are words that have 3 enclitics and Italian 2 (for example: it portagliene> srd bati·nche·nde·li), ordinals, superlatives and NP possessive and lexical selection and morphological disambiguation.
Workplan
Week | Dates | Goals | Bidix | WER / PER | Coverage |
---|---|---|---|---|---|
First period | 4 April - 29 May |
|
|||
1 | 30 May - 4 June |
|
~4000 | initial: 34,8% | ~79,00% |
2 | 5 June - 11 June |
|
~6,000 | ~81.00% | |
3 | 12 June - 18 June |
|
~8,000 | ~82.5% | |
4 | 19 June - 25 June |
|
~10,000 | ~83.5% | |
5 | 26 June - 2 July |
Mentors and students Phase 1 evaluations
|
~12,000 | ~25% | ~84.5% |
6 | 3 July - 9 July |
|
~14,000 | ~85,4% | |
7 | 10 July - 16 July |
|
~16,000 | ~86.5% | |
8 | 17 July - 23 July |
|
~18,000 | ~86.5% | |
9 | 24 July - 30 July |
|
~18,000 | ~15% | ~87,5% |
10 | 31 July - 6 August |
Mentors and students Phase 2 evaluations
|
25,500 | initial 31,9% | 89.1% |
11 | 7 August - 13 August |
|
~25,500 | 89.1% | |
12 | 14 August - 20 August |
|
~25,500 | ~20% | 89.1% |
13 | 21 August - 27 August |
Final evaluation
|
Words
|
WER/PER
|
Coverage
|
Project complete: August 29th 2017.
Include time needed to think, to program, to document and to disseminate.
I'll need the whole period before the approval announcement of projects (04/04/2017-to 29/05/2017) to become familiar with transfer rules to be established to improve the quality of translation. In this period we will work to pending tests in both language pairs.
List your skills and give evidence of your qualifications.
Despite the fact that I do not have a programmer profile, I am strongly determined to carry out this project and to compensate for my lack of knowledge on computational linguistics with the maximum dedication. I feel, however, that my skills are adequate for this project for the following reasons.
Firstly because my native language is Sardinian. I have always spoken Sardinian and I know deeply the characteristics of Sardinian language. I am aware of the advances that have been made to achieve a standard language and I know the phonetic, grammar and literary of the variants of my tongue. My participation in the creation of apertium ita>srd, for which has been used exclusively LSC (Limba Sarda Comuna) standard. Then I collaborated in the creation of Sardware group [8], which deals with the localisation of open-source software in the Sardinian language and I collaborated on the localisation of the messaging open-source software "Telegram" into Sardinian language. The localisation of Telegram has allowed me to improve my skills on the Sardinian standard LSC (Limba Sarda Comuna) and also the Catalan, since the comparison with the Catalan version of the program for the translation choices made from English to Sardinian. Among my publications in LSC include an article published in the number 80 of the Sardinian magazine Lácanas [9], about Apertium, and the translation of an article from the Catalan to Sardinian in an article published in the magazine of Tradúmatica UAB, prof. Martín-Mor, “Sa localizatzione de su programma de messagìstica Telegram a su sardu: s’esperièntzia de Sardware e un’aplicatzione de dotzèntzia” [10].
Secondly, because of my education background. In 2015 I completed my Bachelor's Degree in Foreign Languages. In the course of my studies I have supported various exams relating to translation and interpreting from Spanish and English into Italian. In the academic year 2011-2012 I was awarded an Erasmus scholarship, by means of which I could attend courses at the University of the Basque Country. During this experience I participated in a project of audiovisual translation and subtitling, which then became the subject of my thesis. I also took a course called "Computers for translators" through which I acquired skills on the use of translation memories and CAT tools (SDL Trados and Wordfast). I have as well received specific training on computational linguistics at the University of Cagliari, in which I learned to use markup languages (XML and HTML) for the creation of linguistic corpora. At present, I attend a Master’s Degree in Translation of specialized texts at the University of Cagliari, where I have been trained on translation technologies and localisation.
Last year, I won a scholarship, provided by the plan Erasmus + Traineeship, thanks to which, for the next three months, I will do an internship at the Tradumàtica Research Group at the UAB (Universitat Autònoma de Barcelona). During my stage in Barcelona, I carried out tasks related to translation and localisation with the aid of CAT tools, especially focusing on free software and minoritised languages. In July 2016, in fact, I attended the Tradumàtica Summer School, at the Faculty of Translation and Interpretation of the UAB, which has allowed me to develop these skills even more. The most important thing is that I could participate in Google Summer of Code 2016, with Apertium, for the creation of the language pair apertium-ita-srd. My mentors in this experience were prof. Adrià Martín of Tradumàtica Research Group and Apertium’s mentors were Hèctor Alós Font and Francis Tyers. Last year's project was a success. All the goals have been achieved [11]. In this experience I've got to learn more about the GNU/Linux, the Apertium platform and the .xml language, monodix and bidix. I have compiled the pending test page that served the creation of transfer rules 89 and 35 of the lexical selection rules. I learned to use the data communication system svn and I have kept in touch with the whole community Apertium also later through the IRC chat and mailing list of Apertium. 11th-22nd July, 2016 at the University of Alacant, I attended Ruled-based Machine Translation Summer School, where I met Apertium community and attended the lessons to four actively-developed rule-based machine translation systems, namely, Apertium, Grammatical Framework, Matxin and TectoMT and during which I met so many plans and presented my project.
List any non-Summer-of-Code plans you have for the Summer.
I can guarantee 30 hours per week to work on this project. I will finish my studies at the University of Cagliari on July and my final work will be based on my Google Summer of Code 2016’s project, apertium ita-srd. It’ll allow me to get more theoric competence about Apertium’s platform and about RBMT, and these notions will be useful in developing my new project.
Coding Challenge
In order to check the current status of the two pairs of languages it was calculated WER/PER.
Apertium cat-srd: to evaluate the PER was taken from a Catalan Wikipedia text of about 300 words, translated and post-edited . To calculate the WER was used dwdiff[12] with a result of 34.8%. (https://apertium.projectjj.com/trac/changeset?old_path=%2Fstaging%2Fapertium-cat-srd&old=77481&new_path=%2Fstaging%2Fapertium-cat-srd&new=77481)
Apertium ita-srd: To evaluate WER was taken a text of 300 words LSC, translated and post-edited. To calculate the WER has been used as apertium-eval-translator [13] script with these results:
Evaluation: Results when removing unknown-word marks (stars) Edit distance: 92 Word error rate (WER): 31.94 % Number of position-independent correct words: 206 Position-independent word error rate (PER): 28.47 % (https://apertium.projectjj.com/trac/changeset/77482/trunk/apertium-srd-ita)
Here you can find details about my participation in GSoC 2016 [14].