User:Spiegelian

From Apertium
Revision as of 15:28, 16 May 2014 by Spiegelian (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

GSoC application: apertium hbs-eng, adopting a language pair

Barbara Dujmić

bdujmic@ffzg.hr

spigelian on IRC

Why is it you are interested in machine translation?

One of my majors is Linguistics, the other is English Language and Literature. Other than the simple fact that machine translation gives quick and easy access and insight into a great number of greater or lesser known languages that I have no formal knowledge of, which appeals to me as a linguist, I have always been interested in programming and was intrigued by the more computer-oriented (and logical) aspects of contemporary linguistics. This made me take up courses related to computational linguistics so I could possibly specialize in this field for my MA. I also have a more practical reason, more related to my other major: translation is another possibility for my MA and future work, and machine translation -- when done properly and coupled with considerable empirical and theoretical knowledge of translating between languages -- can be a great aid, providing quick reference and speeding up the process, resulting in better overall productivity.

Why is it that you are interested in the Apertium project?

The goals of Apertium are well aligned with my interests. I will be able to contribute with my knowledge of linguistics and to improve my programming skills by scripting various tasks.

Which of the published tasks are you interested in?

Adopting the Serbo-Croatian<->English language pair

Why should Google and Apertium sponsor it?

English and Serbo-Croatian aren't closely related, which has provided difficulties in machine translation (MT) between the two. Given how Apertium's groundwork for hbs-eng has already been laid, this is a great opportunity to build on it and make a significant step towards an efficient MT.

How and whom it will benefit in society?

Besides benefiting the ones who are learning hsb-eng in one direction or the other, I would introduce my professors and colleagues to this project. I believe that the project would make a good addition to our linguistic courses -- having to think in terms of algorithms forces one to better grasp the matter. Particularly, it would benefit those students who wish to delve deeper into the subject of MT.

What do you plan to do?

While the hbs-eng.eng.dix and hbs-eng.hbs.dix files contain a lot of data, hbs-eng.hsb-eng.dix contains only a handful of pairs. I would primarily work on expanding that, dealing with other tasks as they appear.

Work To do

Before the coding period:

  • meet the organization, study Apertium
  • set exact goals according to needs
  • study ways and resources which could automate significant portions of the task
  • Go through the documentation to get more familiar with the system and to see what it is capable of
  • Gather material such as grammars, dictionaries and corpora
  • Go through the entries in the bidix and fix them
  • Begin adding words from the bidix to the monodixes

The coding period:

Evening up (~18k entries in hbs.dix, ~50k in eng.dix)

  Week 1  19.5.-25.5.	- even up noun entries in the dixes (~7k hbs, ~50k eng)
  Week 2  26.5.-1.6.	- continue evening up nouns
  Week 3  2.6.-8.6.	- continue evening up nouns
                        - testvoc nouns
  Week 4  9.6.-15.6.	- even up verbs (~2k hbs, ~1800 eng)
                        - even up adjectives and adverbs (adj: ~2k hbs, ~3k eng; adv: ~500 hbs, ~1400 eng)
                        - work on transfer rules
  Week 5  16.6.-22.6.	- continue evening up verbs, adjectives and adverbs
                        - testvoc
                        - work on transfer rules
  Week 6  23.6.-29.6.	- wrap up work on nouns, verbs, adjectives, adverbs
                        - 23.-27. mid-term evaluation
                        - deliverable 1: evened-up nouns, verbs, adjectives and adverbs in the dixes
  Week 7  30.6.-6.7.	- even up remaining categories
                        - clean testvoc for all categories
                        - work on transfer rules
                        - deliverable 2: evened-up dixes, clean testvoc
  Week 8  7.7.-13.7.	- work on disambiguation and transfer rules
  Week 9  14.7.-20.7.	- work on disambiguation and transfer rules
  Week 10 21.7.-27.7.	- work on disambiguation and transfer rules
                        - begin testing on corpora
  Week 11 28.7.-3.8.	- work on disambiguation and transfer rules
  Week 12 4.8.-10.8.	- deliverable 3: language pair ready for or close to trunk
                        - documentation
  11.8.-17.8.	- suggested pencils down
  18.8.		- firm pencils down
  22.8.		- final evaluation deadline

Skills and qualifications

I am a 3rd-year (BA) Linguistics and English Language and Literature major at the Faculty of Humanities and Social Sciences in Zagreb, Croatia. This gives me a good knowledge of linguistics, as well as a fair knowledge of computational linguistics, as I have taken courses related to this field during my studies (such as Algebraic Lingusitics and Constructed Languages, where we covered topics such as regular grammars, finite automatons, transducers, regular expressions, local grammars, context-sensitive and context-free grammars, as well as working with Intex).

Besides Croatian being my native language, I am quite fluent in English (a successfully finished 1st year of English Language at my faculty ensuring C2-level competence according to CEFR). This semester I am also undertaking a course focused exclusively on translating between English and Croatian and vice versa (which we have also done before in other courses); I believe this makes me extremely suited for working on this particular language pair.

I think this project would suit me well because it doesn't require specific programming knowledge; I am much better in linguistics than in programming (I am a dabbling Python programmer, but it has already helped me automate some of the legwork of the coding challenge).

Non-GSOC plans

Exams at my faculty are scheduled to take place from July 16th to June 7th, and depending on the particular schedules of my courses I will be forced to spend less than 30 hours per week on this project during a course of maybe two weeks, although I plan to maintain a minimum of 20 hours/week during that period, but maintaining the overall average of at least 30 hours/week.