Difference between revisions of "Grfro3d/proposal apertium cat-srd and ita-srd"
Line 47: | Line 47: | ||
Now coverage is 89,1 and WER is 31,9%. Bidix: 25.500 words. |
Now coverage is 89,1 and WER is 31,9%. Bidix: 25.500 words. |
||
As for apertium ita> srd the goal is to focus on |
As for apertium ita> srd the goal is to focus on |
||
⚫ | '''transfer rules''': through the structural differences highlighted by the completion of pending tests, this time srd>it requires the construction of new transfer rules that allow to solve problems like enclitics and proclítics (Sardinian there are words that have 3 enclitics and Italian 2 (for example: it portagliene> srd bati·nche·nde·li), ordinals, superlatives and NP possessive. |
||
⚫ | |||
⚫ | |||
⚫ | |||
* It would be better to spend more time on apertium cat-srd in order to have chances of having a new translator ready for release at the end of the project. |
* It would be better to spend more time on apertium cat-srd in order to have chances of having a new translator ready for release at the end of the project. |
Revision as of 22:19, 1 April 2017
Contents
Contact Information
Name: Gianfranco Fronteddu
Location: Casteddu, Sardigna
E-mail: gfro3d@gmail.com
IRC: gianfranco
SourceForge: gfro3d
Telegram: gianfro4moros
Skype: gianfranco.fronteddu88
Why is it you are interested in machine translation?
I’m a Translation student and have always been fascinated by Computational Linguistics during my University studies. We have approached to this field of Linguistics through the courses “Theory and techniques of translating” and “Applied linguistics” at the University of Cagliari. As a support to the translation, the MT are divided into two groups: one is MT which is used for "assimilation" —the use of machine translation to understand the general meaning of the text in foreign language. The other approach is instead that of "dissemination" in which the MT is an intermediate step in the production of a document in the TL, which will be published. To facilitate this process, it is usual to adopt the controlled languages, namely to establish the phrases with structures not too complex. An important aspect is free/opensource, which allows using for any purpose software and examining it and then adapting it for the creation of new applications. Therefore, open-source software can be redistributed and improved. Open-source RBMT is, then, very useful for language, thanks to the creation of morphological data such as dictionaries, bilingual dictionaries, grammars and rules and structural transfer files. RBMT systems consist of an engine (coding and decoding), data (linguistic data) and support tools to convert data and make them compatible with the engine. Even if most RBMT systems are private and are born for commercial purposes, open-source RBMT offer the possibility of being able to take advantage of the engine MT, but also to be able to intervene on the code to modify and change the rules. Finally, the advantages of creating a RBMT system are above all the increase of linguistic resources: information collected for the development of a machine translation is easily reusable for other projects and related technologies. In this way, MT can truly become a good support especially for minorised languages in danger of extinction.
Why is it that you are interested in Apertium?
The fact that Apertium is an open-source project means that anyone can contribute to its development. This brings about an interesting point related to the involvement of minoritised language communities. Being myself a speaker of a minoritised language, Sardinian, I would like to give my contribution so that my language can become part of the language combinations offered by this tool. Sardinian is a Romance language deriving from Latin spoken in the island of Sardinia. The Sardinian language is a romance language spoken by approximately one million people in the island of Sardinia. According to Ethnologue, unfortunately, the Sardinian language is in danger of extinction. The linguistic fragmentation and differences between the various dialects have led to a gradual abandonment of Sardinian in favor of the national language, Italian. It resists as the primary language only in some areas of Sardinia, for example, the central ones. The UNESCO Atlas of the World's Languages in Danger (http://www.unesco.org/languages-atlas/index.php). The Limba Sarda Comuna (LSC) has been proposed as the standard form for all varieties of Sardinian. It is an evolved version of the Limba Sarda Unificada (LSU), which was in turn the result of an experts' committee called by the Sardinian government in 2001. In 2006, the Sardinian government adopted the LSC as a co-official language for the publication of official documents. The LSC is also the form chosen by several publishing houses, journals and online sites. However, other romance languages such as Tabarchino Ligurian (in the islands of San Pé and Sant'Antióccu), Algherese Catalan (in the city of L'Alguer), Sassarese (in the city of Sassari) and Gallurese Corsican (in Gaddùra) are spoken in Sardinia. The Sardinian language and other minoritized language of Sardinia are recognised by the regional government's law n. 26 of 15 October 1997 [1] and by the Italian constitution (according to Article 6, "La Repubblica tutela con apposite norme le minoranze linguistiche"{"The Republic safeguards linguistic minorities by means of appropriate measures", Law n. 482 of 15 December 1999 [http://www.camera.it/parlam/leggi/99482l.html, "Rules on protection of historical linguistic minorities", makes it possible for regional governments to use local languages at school. Catalan is spoken in the sardinian city of Alghero by about 33,000 speakers, 8,600 active and 25,000 passive. According to a study of the Generalitat de Catalunya, in Alghero Catalan is understood by 60% of the population, while it is only spoken by 20%. The dominant presence of Catalan in Alghero dates back to the XIV century with the expulsion of the Sardinian populations by the hand of the Aragon Catalans. Later, in Sardinia, Catalan assumed a position of prestige. In 1952, Rafael Sari founded the Center d'Estudis Algueresos for the dissemination and teaching of the Catalan language standards in Sardinia. Among the important people who directed the institute are Rafael Catardi and Antoni Simon Mossa. The "Escola de Alguerés Pascual Scanu" was founded by Josep Sanna and offers courses of catalan language and literature. Among the most important magazines there is L'Alguer, published only in Catalan. I'm interested Apertium because is an OpenSource platform and because it is suitable for similar romance languages. In this case Cat-Srd and Ita>srd are perfect cases that meet this requirement. Given the influence these languages they have had between themsel, it is easy to note that the Sardinian language still presents in many aspects the influence of the Catalan language, because of their coexistence during the period of the Catalan-Aragonese occupation Sardinia. The same goes for the language pair co-eng and, in this case, the similarity is greater. All these languages are part of the Sardinian linguistic heritage. This project would give speakers a valuable tool to improve their skills in standard language and to create new bridges between Sardinia and the linguistic and socio-political realities.
Which of the published tasks are you interested in? What do you plan to do?
I am interested in adopting an unreleased language pair.
Considering my background studies (translation) and my knowledge about MT and Translation Technologies, I plan to improve the language pair cat-srd which at the moment it is in the staging section [2], and keep improving ita>srd that at the moment is in trunk section [3].
My proposal
Title: Apertium cat>srd; Apertium ita>srd
Sardinia is island with a great linguistic wealth. The island five minority languages are spoken. MT and especially a RBMT Apertium could provide additional tools to the island for the preservation of these languages that are in danger of extinction. The release of a version of apertium cat-srd would facilitate linguistic exchanges, in Sardinia and between Sardinia and the Catalan Countries. It would allow the Sardinian speakers to draw closer to their local variants, but also to be aware of how the corresponding standard language. However, because of the structure of the Apertium platform, data is easily interchangeable and this kind of work surely would benefit all 3 languages. It could become a valuable element for teaching in schools of the Catalan algherese. It would allow the Sardinian speakers to draw closer to their local variants, but also to be aware of how the corresponding standard language.
Apertium cat-srd: The project has already started and currently the bidix is in the Trunk section. The Catalan language is definitely the one that has the most resources in the Apertium platform. The cat-srd bidix has 2645 words with a trimmed coverage of 77.7% and a Wer rate of 34.8%: my goal is bring coverage up to 86,5% and WER to 15%. The work that I think I can make concerns the increased coverage in bidix, increasing the number of terms translated from the Catalan to the Sardinian adding 2000 word/week manually up to 18.000 words, including toponimix and family names to the bidix. I would work on transfer rules (for tences, NP with possessive, proclitics, enclitics, ordinals, superlatives) and lexical selection. My intention is to spend a cat-srd ⅔ of the time available (from April, 4 2017 to June, 28 2017) *.
Apertium srd-ita/ita-srd: Last year, thanks to Google Summer of Code program, the first Italian-Sardinian translator was released, apertium ita-srd. The project was a success. All the goals have been achieved: the creation of corpora in LSC; sardinian bidix (apertium-srd-ita.srd-ita.dix) 25.484 words (20.000 in work plan); italian monodix: apertium-srd-srd.dix: 51.743 words; apertium-ita-ita.dix: 35.099 words; Coverage: 89,3% e WER: 10.79% [4]. The pending test page that served the creation of transfer rules 89 and 35 of the lexical selection rules.
Now coverage is 89,1 and WER is 31,9%. Bidix: 25.500 words.
As for apertium ita> srd the goal is to focus on transfer rules: through the structural differences highlighted by the completion of pending tests, this time srd>it requires the construction of new transfer rules that allow to solve problems like enclitics and proclítics (Sardinian there are words that have 3 enclitics and Italian 2 (for example: it portagliene> srd bati·nche·nde·li), ordinals, superlatives and NP possessive.
Lexical selection
- It would be better to spend more time on apertium cat-srd in order to have chances of having a new translator ready for release at the end of the project.
Workplan
Week | Dates | Goals | Bidix | WER / PER | Coverage |
---|---|---|---|---|---|
First period | 4 April - 29 May |
|
|||
1 | 30 May - 4 June |
|
~4000 | initial: ~34,8% | ~79,00% |
2 | 5 June - 11 June |
|
~6,000 | ~81.00% | |
3 | 12 June - 18 June |
|
~8,000 | ~82.5% | |
4 | 19 June - 25 June |
|
~10,000 | ~83,5% | |
5 | 26 June - 2 July |
Deliverable #1 |
~12,000 | ~25% | ~84.5% |
6 | 3 July - 9 July |
|
~14,000 | ~85,4% | |
7 | 10 July - 16 July |
|
~16,000 | ~86.5% | |
8 | 17 July - 23 July |
|
~18,000 | ~86,5% | |
9 | 24 July - 30 July |
Mentors and students Phase 2 evaluations |
~18,000 | ~15% | ~87,5% |
10 | 31 July - 6 August |
|
~65,000 | ~22% | ~90.7% |
11 | 7 August - 13 August |
|
~25,500 | ~31,9% | ~81,9% |
12 | 14 August - 20 August |
|
~25,500 | ~31,9% | ~81.1% |
13 | 21 August - 27 August |
Final evaluation |
~25,500 | ~31,9% | ~81.1% |
Project complete: August 29th 2017.