From Apertium
Jump to navigation Jump to search

GSoC application: apertium sh-mk, implementation of a new language pair

Hrvoje Peradin,

krvoje on IRC: #apertium

Why is it you are interested in machine translation?

It's a perfect combination of Computer science and linguistics. I am very fascinated with languages, both natural or artificial. With MT it fascinates me to see a natural message transfered across a language barrier via only a process of computation.

Why is it that you are interested in the Apertium project?

Apertium was simply my best choice on GSoC. Since I am aiming for my graduate thesis to be about natural language processing, and my University currently doesn't offer a course on that particular subject, this is a unique chance to seriously work first-hand with tools and methods of that particular area. Also, I would be quite happy to contribute to the open source community, and at the same time improve a little bit the status of my mother tongue in the world of machine translation.

Which of the published tasks are you interested in?

Contributing to an orphaned language pair. I intend to target apertium-sh-mk.

Why should Google and Apertium sponsor it?

The language i intend to target encompasses three standard languages, which are altogether spoken by over 15 million people. The morphological dictionary developed in this a project will be a solid foundation for future work on expanding vocabulary, and other language pairings.

How and whom it will benefit in society?

Hopefuly, it'll help improve machine translation of Bosnian, Croatian and Serbian standard languages, and in the future help build new translation systems. For instance, when finished, it can be easily paired with Slovene, since the grammars are quite close.

What do you plan to do?

I plan to reimplement the sh half of the apertium-sh-mk language pair, the bilingual dictionary and transfer rules towards Macedonian. There is some previous work done on the sh part, including some handy methods for handling the difference between the three standards. However, some linguistic paradigms are missing, and the documentation on the entire implementation is quite scarce. To be more productive, I will reimplement the entire dictionary from scratch, using the old code as a rough guideline, and recycling some parts of it.

The Macedonian language morphological dictionary is already efficiently implemented, and I plan to rely on it while implementing the transfer rules. As some of Macedonian syntax makes it difficult to implement translation rules for the opposite direction, and such a task would exceed the time frame I have at my disposal, I will implement only rules for sh->mk.

As a source of parallel corpora i intend to use . It contains news on all four languages involved. I intend to investigate some other potential sources.

I will also be using the portals (it contains grammatical description of 116.516 lemmata, it will be a most invaluable resource), the site, grammars of the Croatian language (Silić/Pranjković, also Barić et. al), some grammar of the Macedonian language (some are available online, like Macedonian by Victor Friedman, also there exist a few high-quality local publications). There are also some additional potentialy useful online resources on Croatian which i intend to investigate.

I will try my best to cover a general lexicon and grammar of all three standards. However, since I am effectively only passively profficient with the Bosnian and Serbian discourse (the differences can be quite subtle), I will focus mainly on the Croatian module. I expect and hope for additional future work to be added in the remaining two modules.

Luckily, the differences are either mostly predictable like the reflex of short and long yat (in ekavian it's almost always an e, while in ijekavian it gives je/ije respectively, but also sometimes e or i), but nouns can also have a different gender (like večer f vs. veče n), a different ortography (for instance, future I "gledat ću" vs. "gledaću") etc. There are already some general solutions in the existing work on the language, I will try and build upon them.

Work already done

Community bonding period

- installation of the Apertium environment, and reading of the materials

- a thorough study of previous work on the sh-mk pair, mostly by going through code, and experimenting with apertium-viewer and command line

- done a rudimentary checklist of the most important books and websites to be used as resources

- preparing a preliminary morphological dictionary to start writing from scratch (expanding on the tutorial, and experimenting)

Work To do

Before the coding period:

- familiarize myself some more with Apertium's tools, a more thourough study of the specification

- do an update on the New Language pair HOWTO

- experimenting with some other MT tools, to see what can potentialy be used

- thinking out scripting or programming solutions which could quicken my work

- search for some more resources online to aid me

The coding period:

- Weeks 1-3: Implementation of the morphology paradigms, along with some lemmas they cover. As i add the lemmas, I will also create entries for them in the bidix.

- Deliverable #1 : A complete morphology, with some differences in local standard languages built in.

- Week 4-7: Continue with the input of lemmata, in parallel adding them to the bidix. Writing the transfer rules, some preliminary design of disambiguation rules.

- Week 8: Continue on the disambiguation rules. Speech tagger training using manualy tagged corpora, or some other method if available.

- Deliverable #2 : Transfer rules + POS tagger completed

- Week 9-12: Testing with testvoc, and other types of proofing. Brushing up parts if neccesarry. Writing the documentation.

- Deliverable #3 : morphology+ bilingual + POS tagger + transfer rules + vocabulary + documentation

Non-GSoC activities

- Classes till the 1st of July, but available 30 h per week

- Final exams in June, they end 1st of July

- GF Summer school 15-26 August


I am an Undergraduate student of Computer Science at the Faculty of Science, University of Zagreb.

During my courses I have worked with C/C++, C\#, Java, JavaScript, PHP + CSS + HTML, XML, SQL... I also have a basic knowledge of Perl and Ruby, and recently I have started learning some functional programming, particularly Haskell and GF.

Regarding the technologies used in machine translation we I've been enrolled in courses with finite state machines, and context free grammars (implementation of a parser using yacc+flex). Right now I'm attending a course about Machine learning, for which I believe would be a great complement to working with Apertium. Currently I'm also constructing a grammar of the Croatian language in GF, as a part of the assignment for the GF summer school.

I have never been involved in any open source projects. However, I would in any case like to do some work on Apertium, even if I am not selected for GSoC.