Difference between revisions of "Sardu abbarra bivu!"

From Apertium
Jump to navigation Jump to search
 
Line 8: Line 8:


'''Why is it you are interested in Machine Translation?'''
'''Why is it you are interested in Machine Translation?'''
I’m a Translation student and have always been fascinated by Computational Linguistics during my University studies. We have approached to this field of Linguistics through the courses “Theory and techniques of translating” and “Applied linguistics” at the University of Cagliari. These have allowed me to gain a general understanding of Machine Translation (MT) and its history, from the first engines in the ‘50s —based on bilingual dictionaries that worked through “word to word” translation— until today, including some of the significant advances in the discipline (IBM system in the ‘50s with 250 words and 6 grammar rules, in 1983, the first automatic translation program for PC, which was immediately adopted by many big companies as IBM).
I’m a Translation student and have always been fascinated by Computational Linguistics during my University studies. We have approached to this field of Linguistics through the courses “Theory and techniques of translating” and “Applied linguistics” at the University of Cagliari. These have allowed me to gain a general understanding of Machine Translation (MT) and its history, from the first engines in the ‘50s —based on bilingual dictionaries that worked through “word to word” translation— until today, including some of the significant advances in the discipline (IBM system in the ‘50s with 250 words and 6 grammar rules, in 1983, the first automatic translation program for PC, which was immediately adopted by many big companies as IBM). There are mainly two approaches to MT, the Statistical (SMT) and the Rule-Based (RBMT), which includes the translation based on the principle of transfer. Words, in this approach, are translated according to a purely linguistic point of view choosing the appropriate linguistic equivalent. Many famous MT systems are based on rules. The most popular are surely Apertium and Lucy Translator. The other main approach, SMT, relies on parallel corpora containing real texts and their corresponding translations. The objective of this approach is to generate a translation from statistical methods based on bilingual and monolingual corpora of texts. Other recent approaches to MT include the neural MT or the context-based MT, which gets the best translation of a word by considering the rest of the words that surround it. The context-based Machine Translation presents a greater advantage respect to MT based on corpora: adding new languages ​​is very easy. To create a new language pair, in fact, it is not necessary to include corpora with millions of words as in the statistical methods: it takes only two smaller corpora and a dictionary containing rules to conjugate verbs and to match nouns and adjectives.
There are mainly two approaches to MT, the Statistical (SMT) and the Rule-Based (RBMT), which includes the translation based on the principle of transfer. Words, in this approach, are translated according to a purely linguistic point of view choosing the appropriate linguistic equivalent. Many famous MT systems are based on rules. The most popular are surely Apertium and Lucy Translator. The other main approach, SMT, relies on parallel corpora containing real texts and their corresponding translations. The objective of this approach is to generate a translation from statistical methods based on bilingual and monolingual corpora of texts. Other recent approaches to MT include the neural MT or the context-based MT, which gets the best translation of a word by considering the rest of the words that surround it. The context-based Machine Translation presents a greater advantage respect to MT based on corpora: adding new languages ​​is very easy. To create a new language pair, in fact, it is not necessary to include corpora with millions of words as in the statistical methods: it takes only two smaller corpora and a dictionary containing rules to conjugate verbs and to match nouns and adjectives.
As for the effects of MT in professional translation, MT allows translators to increase their translation capacity and to offer a broader range of services to clients (more words per hour). However, MT is often criticised because it allegedly brings poorer results; in my opinion, quality must be achieved through the cooperation of the MT engine and the translator. To put in other words, no one better than a translator trained in the use of MT to prepare texts and to correct their outputs, since s/he will be aware, even before the translation takes places, of the possible problems that might arise during the translation phase.
As for the effects of MT in professional translation, MT allows translators to increase their translation capacity and to offer a broader range of services to clients (more words per hour). However, MT is often criticised because it allegedly brings poorer results; in my opinion, quality must be achieved through the cooperation of the MT engine and the translator. To put in other words, no one better than a translator trained in the use of MT to prepare texts and to correct their outputs, since s/he will be aware, even before the translation takes places, of the possible problems that might arise during the translation phase.
Finally, as will be described below, MT —together with other translation technologies― has proved to be a crucial tool when it comes to the survival of endangered languages.
Finally, as will be described below, MT —together with other translation technologies― has proved to be a crucial tool when it comes to the survival of endangered languages.

Latest revision as of 00:41, 4 June 2018

Name: Gianfranco Fronteddu

E-mail address: gfro3d@gmail.com

Other information that may be useful to contact you:

Telegram username: gianfro4moros Skype: gianfranco.fronteddu88

Why is it you are interested in Machine Translation? I’m a Translation student and have always been fascinated by Computational Linguistics during my University studies. We have approached to this field of Linguistics through the courses “Theory and techniques of translating” and “Applied linguistics” at the University of Cagliari. These have allowed me to gain a general understanding of Machine Translation (MT) and its history, from the first engines in the ‘50s —based on bilingual dictionaries that worked through “word to word” translation— until today, including some of the significant advances in the discipline (IBM system in the ‘50s with 250 words and 6 grammar rules, in 1983, the first automatic translation program for PC, which was immediately adopted by many big companies as IBM). There are mainly two approaches to MT, the Statistical (SMT) and the Rule-Based (RBMT), which includes the translation based on the principle of transfer. Words, in this approach, are translated according to a purely linguistic point of view choosing the appropriate linguistic equivalent. Many famous MT systems are based on rules. The most popular are surely Apertium and Lucy Translator. The other main approach, SMT, relies on parallel corpora containing real texts and their corresponding translations. The objective of this approach is to generate a translation from statistical methods based on bilingual and monolingual corpora of texts. Other recent approaches to MT include the neural MT or the context-based MT, which gets the best translation of a word by considering the rest of the words that surround it. The context-based Machine Translation presents a greater advantage respect to MT based on corpora: adding new languages ​​is very easy. To create a new language pair, in fact, it is not necessary to include corpora with millions of words as in the statistical methods: it takes only two smaller corpora and a dictionary containing rules to conjugate verbs and to match nouns and adjectives. As for the effects of MT in professional translation, MT allows translators to increase their translation capacity and to offer a broader range of services to clients (more words per hour). However, MT is often criticised because it allegedly brings poorer results; in my opinion, quality must be achieved through the cooperation of the MT engine and the translator. To put in other words, no one better than a translator trained in the use of MT to prepare texts and to correct their outputs, since s/he will be aware, even before the translation takes places, of the possible problems that might arise during the translation phase. Finally, as will be described below, MT —together with other translation technologies― has proved to be a crucial tool when it comes to the survival of endangered languages.

Why is it that you are interested in Apertium? Apertium originated as one of the MT engines in the OpenTrad project, funded by the Spanish Government. It was designed primarily to translate between similar language pairs, although it has recently been expanded to translate more divergent languages. New language pairs can be added by creating dictionaries and rules containing linguistic data in XML format. The fact that Apertium is an open-source project means that anyone can contribute to its development. This brings about an interesting point related to the involvement of minoritised language communities. Being myself a speaker of a minoritised language, Sardinian, I would like to give my contribution so that my language can become part of the language combinations offered by this tool. Sardinian is a Romance language deriving from Latin spoken in the island of Sardinia. After the Roman Empire, along the centuries was subjected to domination by various populations: Vandals, Pisans and Genoese, Aragonese and Spanish and finally Piedmontese and Italian. The Sardinian language has resisted any domination, even if it has remained linguistically influenced by the languages ​​of each period of domination. Nowadays, Sardinian is a language system which is distinguished in two main variants: Campidanese (https://www.ethnologue.com/language/sro), spoken in central and southern Sardinia, and Logudorese (https://www.ethnologue.com/language/src), spoken in central and northern Sardinia. According to Ethnologue, unfortunately, the Sardinian language is in danger of extinction. The linguistic fragmentation and differences between the various dialects have led to a gradual abandonment of Sardinian in favor of the national language, Italian. It resists as the primary language only in some areas of Sardinia, for example, the central ones. The UNESCO Atlas of the World's Languages in Danger (http://www.unesco.org/languages-atlas/index.php) reports that Logudorese is spoken mainly in the central part of Sardinia by about 400,000 people. Whereas Campidanese, spoken in the south of Sardinia, by about 900,000 people. To prevent its extinction, a language standardization project was initiated in order to create a new grammar and a new spelling, valid for everyone, which took the name of LSC (Common Sardinian Language). The creation of an MT engine would be of great utility for the language: firstly, a RBMT as Apertium would ease written production, essential to complete the process of standardization. In neighbouring cases such as that of Catalonia, MT has increased the presence of the Catalan language at various levels. For instance, since 1997, when for the first time a Catalan newspaper started publishing a bilingual daily edition (Catalan/Spanish), at least three other newspapers have followed this same steps: El Periódico, El País and La Vanguardia. At least 2 of them in papers. Which of the published tasks are you interested in? What do you plan to do? I am interested in adopting an unreleased language pair. Considering my background studies (translation) and my knowledge about MT and Translation Technologies, I plan to build up the language pair Italian-Sardinian. It must be acknowledged that some work for the Sardinian language has already been carried out, as can be seen into the Apertium Incubator (http://wiki.apertium.org/wiki/Incubator). Specifically, some dictionaries are available for the language pair Catalan-Sardinian, Portuguese-Sardinian and Italian-Sardinian.

My proposal. Title: Sardu, abbarra vivu! (Sardinian, keep yourself alive!) The project I intend to carry out is the creation of a MT engine for the language pair Italian-Sardinian based on the Apertium platform. As pointed out above, MT is crucial for the survival of minoritised languages. Apertium, having lead the development of RBMT engines in the last years, provides an excellent framework for language pairs of the same linguistic family without the need of linguistic corpora. The experience of Apertium with several minoritised languages such as Occitan, Asturian or Maltese proofs that such a project is viable. Google and Apertium would benefit from this project, not only because it would contribute to open-source software and minority languages, but especially because it would have a great impact in the Sardinian society, since at present there is no MT system for the Sardinian language, neither by Google nor by Apertium. As for the beneficiaries, the examples given above for similar cases (such as the one of the Catalan language) show that the outcome of this project might have a commercial impact as well, since media, such as newspapers, magazines and websites, as stakeholders in the field of the written production, could be interested in including the MT system in their publication workflows and therefore in assuming experts for the customisation and the improvement of the engine. Furthermore, such a tool could have an impact as well from the educational point of view, because new generations could gain access to the Sardinian language.

Include time needed to think, to program, to document and to disseminate. Not having any previous experience on the building of language pairs nor on the functioning of Apertium, I estimate that the first four weeks (from April 22nd 2016 to May 22nd 2016) will be employed in the acquisition of knowledge and understanding of the Apertium framework.

Work plan From May 23th to August 23th 2016 Week 1: From May 23th to May 30th 2016. 30 hours. Look for Italian Dictionaries already existent; Installation of: lttoolbox (>= 3.3.0); apertium (>= 3.3.0); a text editor; set up file of basic XML skeleton for the creation of morphological Sardinian and Italian dictionaries (wget; python3 apertium-init.py ita; python3 apertium-init.py sc; python3 apertium-init.py ita-sc). Week 2: From May 31st to June 6th 2016. 30 hours. Creation of three directories apertium-ita, apertium-sc, apertium-ita-sc; start with the creation of Sardinian morphological dictionary (Alphabet, Symbols, Paradigms, Standard sections). About spelling rules I’ll refer to LSC (Common Sardinian language) (http://www.regione.sardegna.it/documenti/1_72_20060418160308.pdf) . Week 3: From June 7th to June 14th 2016. 40 hours. Work on Sardinian Morphological dictionary. Week 4: From June 15th to June 19th 2016. 20 hours Work on Sardinian Morphological dictionary. From June 20th to June 25 pause due to academic issues. Deliverable #1 June 27th 2016. Google midterm evaluation: Italian and Sardinian morphological dictionaries. Week 5:From June 27 to 4 July 2016. 30 hours. Acquisition of knowledge and understanding of generating bilingual dictionaries. Creation of the Bilingual dictionary file name apertium-ita-sc.sc-ita.dix; Week 6: From July 5th to July 11th 2016. 30 hours. Start generating Bilingual dictionary: creation of the basic XML skeleton. Adding an entry to translate between Italian-Sardinian words. Week 7: From July 12th to July 19. 30 hours. Work on bilingual dictionary. Week 8: From July 20 to July 27. 30 hours. Work on bilingual dictionary Deliverable #2 July 28th 2016: Bilingual dictionary. Week 9: From July 29th to 5th August. 30 hours. Creation of Transfer Rule file. Set up of basic skeleton and especially of grammatical symbols input/output rules. Week 10: From August 6th to August 11th 2016. 30 hours. Work at Transfer rule file defining categories and symbols. We’ll also try to recycle the work already done from existing language pair (http://wiki.apertium.org/wiki/Incubator). Week 11: From August 12nd to August 19th. Review and finalization of the project. Week 12: From August 19th to August 26th. Submission of the project to the mentors for the final evaluation. Project complete: August 29th 2016. Project complete.

List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects. Despite the fact that I do not have a programmer profile, I am strongly determined to carry out this project and to compensate for my lack of knowledge on computational linguistics with the maximum dedication. I feel, however, that my skills are adequate for this project for the following reasons. Firstly because my native language is Sardinian. I have always spoken Sardinian and I know deeply the characteristics of Sardinian language. I am aware of the advances that have been made to achieve a standard language and I know the phonetic, grammatical and literary of the variants of my tongue. Secondly, because of my education background. In 2015 I completed my Bachelor's Degree in Foreign Languages. In the course of my studies I have supported various exams relating to translation and interpreting from Spanish and English into Italian. In the academic year 2011-2012 I was awarded an Erasmus scholarship, by means of which I could attend courses at the University of the Basque Country. During this experience I participated in a project of audiovisual translation and subtitling, which then became the subject of my thesis. I also took a course called "Computers for translators" through which I acquired skills on the use of translation memories and CAT tools (SDL Trados and Wordfast). I have as well received specific training on computational linguistics at the University of Cagliari, in which I learned to use markup languages (XML and HTML) for the creation of linguistic corpora. At present, I attend a Master’s Degree in Translation of specialised texts at the University of Cagliari, where I have been trained on translation technologies and localisation. My passion about translation technologies has allowed me to be selected by the University of Cagliari to teach during the month of March a course on Computer Translation, in the context of which I have taught Audiovisual Translation and CAT tools for 40 hours. Finally, I have recently won a scholarship, provided by the plan Erasmus + Traineeship, thanks to which, for the next three months, I will do an internship at the Tradumàtica Research Group at the UAB (Universitat Autònoma de Barcelona). During my stage in Barcelona, I will carry out tasks related to translation and localisation with the aid of CAT tools, especially focusing on free software and minoritized languages. I am convinced that the things I will learn with the Tradumàtica Research group at the UAB will allow me to carry out the project I am submitting with success. Despite the fact that I have never developed an open-source project myself, I do have participated in open-source projects involving the modification of the source code of software in order to translate it into other languages. For instance, I have participated in the localisation of the open-source instant messaging client Telegram into Sardinian, both for iPhone and Android platforms, and I plan to be able to complete these translations over the next few weeks.

List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project. I can guarantee 30 hours per week to work on this project. I will finish my studies at the University of Cagliari on June and during my stage in Barcelona I will be able to work in this project. My obligations, therefore, leave me plenty of hours, especially during the weekends, to devote to it. My stay in Barcelona will end on 07.31.2016, and from there on there will be a pause in the period from 06.19.2016 to 06.25.2016, due to academic issues. I plan to increase the number of work hours during the previous and the following weeks so that this pause will not affect the global calendar.