User:Spiegelian

From Apertium
Revision as of 17:54, 21 March 2014 by Spiegelian (talk | contribs) (Created page with "== GSoC application: apertium hbs-eng, adopting a language pair == Barbara Dujmić bdujmic@ffzg.hr spigelian on IRC =Why is it you are interested in machine translation?=...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

GSoC application: apertium hbs-eng, adopting a language pair

Barbara Dujmić

bdujmic@ffzg.hr

spigelian on IRC

Why is it you are interested in machine translation?

One of my majors is Linguistics, the other is English Language and Literature. Other than the simple fact that machine translation gives quick and easy access and insight into a great number of greater or lesser known languages that I have no formal knowledge of, which appeals to me as a linguist, I have always been interested in programming and was intrigued by the more computer-oriented (and logical) aspects of contemporary linguistics. This made me take up courses related to computational linguistics so I could possibly specialize in this field for my MA. I also have a more practical reason, more related to my other major: translation is another possibility for my MA and future work, and machine translation -- when done properly and coupled with considerable empirical and theoretical knowledge of translating between languages -- can be a great aid, providing quick reference and speeding up the process, resulting in better overall productivity.

Why is it that you are interested in the Apertium project?

The goals of Apertium are well aligned with my interests. I will be able to contribute with my knowledge of linguistics and to improve my programming skills by scripting various tasks.

Which of the published tasks are you interested in?

Adopting the Serbo-Croatian<->English language pair

Why should Google and Apertium sponsor it?

English and Serbo-Croatian aren't closely related, which has provided difficulties in machine translation (MT) between the two. Given how Apertium's groundwork for hbs-eng has already been laid, this is a great opportunity to build on it and make a significant step towards an efficient MT.

How and whom it will benefit in society?

Besides benefiting the ones who are learning hsb-eng in one direction or the other, I would introduce my professors and colleagues to this project. I believe that the project would make a good addition to our linguistic courses -- having to think in terms of algorithms forces one to better grasp the matter. Particularly, it would benefit those students who wish to delve deeper into the subject of MT.


What do you plan to do?

While the hbs-eng.eng.dix and hbs-eng.hbs.dix files contain a lot of data, hbs-eng.hsb-eng.dix contains only a handful of pairs. I would primarily work on expanding that, dealing with other tasks as they appear.


Work To do

Right now

- continue working on the coding challengeor the weekend

Before the coding period:

- meet the organization, study Apertium - set exact goals according to needs - study ways and resources which could automate significant portions of the task

The coding period:

As I have just learned about GSOC less than 3 days ago, I was unable to get acquainted with Apertium enough to provide clear goals. I am, however, confident about being able to perform well on this project. Even tough an entire day was spent on installation (courtesy of my archaic computer), I managed to grasp the basic principles of Apertium and expand the .dix files to (slightly) improve the translation I was given. Assuming I will be at least five times as productive once I am properly acquainted, hundreds of entries per working day should not be a problem.

Skills and qualifications

I am a 3rd-year (BA) Linguistics and English Language and Literature major at the Faculty of Humanities and Social Sciences in Zagreb, Croatia. This gives me a good knowledge of linguistics, as well as a fair knowledge of computational linguistics, as I have taken courses related to this field during my studies (such as Algebraic Lingusitics and Constructed Languages, where we covered topics such as regular grammars, finite automatons, transducers, regular expressions, local grammars, context-sensitive and context-free grammars, as well as working with Intex).

Besides Croatian being my native language, I am quite fluent in English (a successfully finished 1st year of English Language at my faculty ensuring C2-level competence according to CEFR). This semester I am also undertaking a course focused exclusively on translating between English and Croatian and vice versa (which we have also done before in other courses); I believe this makes me extremely suited for working on this particular language pair.

I think this project would suit me well because it doesn't require specific programming knowledge; I am a much better in linguistics than in programming (I am a dabbling Python programmer, but it has already helped me automate some of the legwork of the coding challenge).

Non-GSOC plans

Exams at my faculty are scheduled to take place from July 16th to June 7th, and depending on the particular schedules of my courses I will be forced to spend less than 30 hours per week on this project during a course of maybe two weeks, although I plan to maintain a minimum of 20 hours/week during that period, but maintaining the overall average of at least 30 hours/week.