Difference between revisions of "User:Nemo bis/English and Italian"

From Apertium
Jump to navigation Jump to search
(stub)
 
m (typo)
 
(One intermediate revision by the same user not shown)
Line 10: Line 10:


==Project==
==Project==

[[English and Italian]]!

* The pair is not released yet, so the GSoC project would actually be two in one: [[Ideas for Google Summer of Code/Adopt a language pair|Adopt a language pair]] + [[Ideas for Google Summer of Code/Make a language pair state-of-the-art|Make a language pair state-of-the-art]].
* Why this pair?
** I want to contribute to Apertium, also because I want to contribute to the Translate extension and to the projects using Translate, and I want to do so in a way that is special, doing something that nobody else is able or interested in doing. Providing Apertium and hence Translate with a translation pair seems to be the best way possible for me.
** I tried asking around and wondering who else in Italy could be interested in developing this Apertium pair, but the university professors I reached out to couldn't think of anyone who could be interested in any kind of work in this area in Italy. There are also very few MediaWiki/Wikimedia developers (zero) and FLOSS developers (including GSoC students: google-opensource.blogspot.it/2013/08/google-summer-of-code-full-of-stats.html]) from Italy. I guess the environment is not favourable, we may never find anyone interested. Despite everything, and until the imminent implosion of our decaying culture, Italian is still an important language and having such a pair would certainly be something to be proud of for Apertium and Google, if we succeeded.
** I have an interest in multilingual Wikimedia projects, for a number of reasons. [[Wiktionary:Meta:Special:LanguageStats/it]] is not good at all; [[Wiktionary:Meta:Meta:Babylon/Translation stats|Italian translation is not in top 20 by activity]] despite Italian language projects being in top 7 by visits: this depresses me. However, I already put too much time into Wikimedia and one of the few rules I manage to respect is that I don't do translations myself unless there is some important text to proofread, otherwise it would be a full time job alone. Instead, I can provide the existing translators with a tool to make their work easier. We can't use Google Translate on Wikimedia projects due to licensing and privacy policy. but we could use Apertium. Our translation memory is fantastic, but helps only to a degree.
** I'm a language geek, but I'm not that good at learning new languages. I know Italian extremely well due to extensive literature reading and intense grammar studying/discussing; and I'm good enough at English thanks to many years of daily written use and a passion for Edward Morgan Forster and Virginia Woolf in original language... but I don't know other languages in a useful way. I can understand Milanese and Venetian for family reasons, but not speak them; and they would not be useful for point (1).
* How is it going to work?
** [[User:Francis Tyers]] indicates as first requirement for a successful pair "Not in Google or can get better quality than in Google". Google seems to be around 20-25 % WER, so the europarl/moses baseline is probably around 25-30 %, so we'd need to reach 30-35 % to be useful and have a reasonable starting point. It doesn't seem impossible.
** Most of the current stub in the incubator comes from the work on Spanish. [[English and Spanish]] is a released pair, it should be possible to reach the same status for Italian and probably reuse some more work done on that pair since 2010 when the last changes to en-it were made.
* What are your qualifications? Why do you think you can manage?
** The following skills are requested by the project description:
*** XML: well, this is "my bread" as we say in Italian, I handled TB of XML by exporting wikis in [https://archive.org/details/wikiteam wikiteam];
***a scripting language (Python, Perl): I don't call myself a coder, but I'm de facto maintainer of [http://code.google.com/p/wikiteam/source/browse/trunk wikiteam scripts] since about 2012 and I've worked with pywikibot since 2006 (especially grammar fixes and other bot replacements; a love story with regex);
*** good knowledge of the language pair adopted: see above; as past experience, I translated PhpBB in 2005 when I was moderator of one of the biggest forums of italianistics.[http://achyra.org/cruscate/profile.php?mode=viewprofile&u=55]
** Dedication: I rarely give up on a project I take up, even though sometimes it takes much longer than expected. For instance I started [[wikt:it:Wikizionario:Importazione dizionari PD|project to import a public domain Italian vocabulary into the Italian Wiktionary]] around 2009. It proved much harder than expected and nobody ever helped me, in practice, but I never gave up. Slowly, it keeps progressing (at bursts). Experience with Apertium should also help me when the time will come to import the transcribed vocabulary into Wiktionary, as I probably won't find a bot owner to whom to delegate this task.
** Coding challenge: ... (''I expect to be able to do a variant of both in less than a day of work; will surely try at some point just for fun'')
** Other

=== Schedule ===

I'd make one if I actually applied! It would be interesting to know what parts would require how much relative effort, though. Is there a use case/study on how much work was put in the various phases of the en-es pair?

I don't plan any full time job this summer; I hope to attend a summer course of Finnish in July-August, depending on available time and accommodation; main exam sessions in my department are in June, July and September; if this project takes more time than expected I can always reduce the huge pool of hours committed to Wikimedia/MediaWiki stuff (it would be nice to learn once again how to limit that ;), I manage only when forced by other commitments).

Latest revision as of 15:20, 21 August 2014

I'm interested in English and Italian. This page is structured as a stub of GSoC application because I thought of applying for it in 2014, but I won't because [1] (Nikerabbit prefers me to be primary/official mentor of a MediaWiki GSoC project, as I had previously promised).

Personals[edit]

Name
Federico Leva
E-mail address
FirstLastname@tiscali.it
Other information that may be useful to contact you
Email is a safe bet with me (unless it bounced back to you, of course): if it was clear I had to reply or act on it, then it's in the queue. I enable email notifications for all wikis where it's possible, so you also have a few hundreds talk pages available depending on the topic.
Why is it you are interested in machine translation? 
Because I'm interested in language and i18n/l10n and I'm working on it since 2005 or so. I'm translatewiki.net's "pokemaster" (as Nikerabbit called me once) and I'm active as MediaZilla: triager (in all-time top 10 for some activity metrics) as well as some i18n code tweaking. Plus other Wikimedia stuff you can find by following links from my user page, too much to list. Machine translation has always been a hot topic in Wikimedia.
Why is it that you are interested in the Apertium project? 
I first met it on translatewiki.net around 2009–2010 I guess; my interest was revived when Niklas followed a course on it by Francis Tyers and Tommi Pirinen in 2013 (On course to machine translation).
Studies
undergraduate, maths at unimi.it

Project[edit]

English and Italian!

  • The pair is not released yet, so the GSoC project would actually be two in one: Adopt a language pair + Make a language pair state-of-the-art.
  • Why this pair?
    • I want to contribute to Apertium, also because I want to contribute to the Translate extension and to the projects using Translate, and I want to do so in a way that is special, doing something that nobody else is able or interested in doing. Providing Apertium and hence Translate with a translation pair seems to be the best way possible for me.
    • I tried asking around and wondering who else in Italy could be interested in developing this Apertium pair, but the university professors I reached out to couldn't think of anyone who could be interested in any kind of work in this area in Italy. There are also very few MediaWiki/Wikimedia developers (zero) and FLOSS developers (including GSoC students: google-opensource.blogspot.it/2013/08/google-summer-of-code-full-of-stats.html]) from Italy. I guess the environment is not favourable, we may never find anyone interested. Despite everything, and until the imminent implosion of our decaying culture, Italian is still an important language and having such a pair would certainly be something to be proud of for Apertium and Google, if we succeeded.
    • I have an interest in multilingual Wikimedia projects, for a number of reasons. Wiktionary:Meta:Special:LanguageStats/it is not good at all; Italian translation is not in top 20 by activity despite Italian language projects being in top 7 by visits: this depresses me. However, I already put too much time into Wikimedia and one of the few rules I manage to respect is that I don't do translations myself unless there is some important text to proofread, otherwise it would be a full time job alone. Instead, I can provide the existing translators with a tool to make their work easier. We can't use Google Translate on Wikimedia projects due to licensing and privacy policy. but we could use Apertium. Our translation memory is fantastic, but helps only to a degree.
    • I'm a language geek, but I'm not that good at learning new languages. I know Italian extremely well due to extensive literature reading and intense grammar studying/discussing; and I'm good enough at English thanks to many years of daily written use and a passion for Edward Morgan Forster and Virginia Woolf in original language... but I don't know other languages in a useful way. I can understand Milanese and Venetian for family reasons, but not speak them; and they would not be useful for point (1).
  • How is it going to work?
    • User:Francis Tyers indicates as first requirement for a successful pair "Not in Google or can get better quality than in Google". Google seems to be around 20-25 % WER, so the europarl/moses baseline is probably around 25-30 %, so we'd need to reach 30-35 % to be useful and have a reasonable starting point. It doesn't seem impossible.
    • Most of the current stub in the incubator comes from the work on Spanish. English and Spanish is a released pair, it should be possible to reach the same status for Italian and probably reuse some more work done on that pair since 2010 when the last changes to en-it were made.
  • What are your qualifications? Why do you think you can manage?
    • The following skills are requested by the project description:
      • XML: well, this is "my bread" as we say in Italian, I handled TB of XML by exporting wikis in wikiteam;
      • a scripting language (Python, Perl): I don't call myself a coder, but I'm de facto maintainer of wikiteam scripts since about 2012 and I've worked with pywikibot since 2006 (especially grammar fixes and other bot replacements; a love story with regex);
      • good knowledge of the language pair adopted: see above; as past experience, I translated PhpBB in 2005 when I was moderator of one of the biggest forums of italianistics.[2]
    • Dedication: I rarely give up on a project I take up, even though sometimes it takes much longer than expected. For instance I started project to import a public domain Italian vocabulary into the Italian Wiktionary around 2009. It proved much harder than expected and nobody ever helped me, in practice, but I never gave up. Slowly, it keeps progressing (at bursts). Experience with Apertium should also help me when the time will come to import the transcribed vocabulary into Wiktionary, as I probably won't find a bot owner to whom to delegate this task.
    • Coding challenge: ... (I expect to be able to do a variant of both in less than a day of work; will surely try at some point just for fun)
    • Other

Schedule[edit]

I'd make one if I actually applied! It would be interesting to know what parts would require how much relative effort, though. Is there a use case/study on how much work was put in the various phases of the en-es pair?

I don't plan any full time job this summer; I hope to attend a summer course of Finnish in July-August, depending on available time and accommodation; main exam sessions in my department are in June, July and September; if this project takes more time than expected I can always reduce the huge pool of hours committed to Wikimedia/MediaWiki stuff (it would be nice to learn once again how to limit that ;), I manage only when forced by other commitments).