From Apertium
< User:Krvoje
Revision as of 11:30, 7 April 2012 by Krvoje (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

GSoC application: apertium sh-sl, implementation of a new language pair

Hrvoje Peradin,

krvoje on IRC: #apertium

Why is it you are interested in machine translation?

It's a perfect combination of Computer science and linguistics. I am very fascinated with languages, both natural or artificial. With MT it fascinates me to see a natural message transfered across a language barrier via only a process of computation.

Why is it that you are interested in the Apertium project?

I have worked on a language pair in last year's GSoC, and it gave me great insights on rule based NLP. It gave me an invaluable chance to do real-life work on an immensely interesting topic, and to create an open-source resource. It was a great experience that taught me a lot about software development and NLP - and it also gave me the theme for my master thesis. So, I could say Apertium has a special place in my heart, and I would love to continue working on it.

Which of the published tasks are you interested in?

Starting a new language pair, Slovene - Serbo-Croatian

Why should Google and Apertium sponsor it?

The languages I intend to target are quite close grammatically and syntactically, which makes them quite appropriate for constructing a rule-based MT pair. A result that is useful in practice can be obtained with relatively uncomplicated rules. Moreover, the CG disambiguation rules can be easily expanded and adapted for use on both languages, so the result of this work will be usable very quickly.

How and whom it will benefit in society?

In spite of the high degree of mutual inteligibility between Slovene and the standard languages of Bosnia, Serbia and Croatia, there is a tiny language barrier, that a correct MT system would help overcome. Work on this pair will also improve the Serbo-Croatian morphological analyser and disambiguation rules, providing a great free resource for three languages, alltogether spoken by over 15 million people.

What do you plan to do?

I plan to increase coverage of the Serbo-Croatian morphological analyser to match coverage of the Slovenian analyser, write a bidix for the language pair and transfer rules, as well as improve the current Constraint Grammar disambiguation rules for Serbo-Croatian, and, if time permits, do minor adaptations to apply the rules for disambiguating Slovene.

The morphological analyser for Slovene was written for last year's GSoC by Aleš Horvat, and it currently contains around 20700 entries. The analyser for Serbo-Croatian was authored by me, also for last year's GSoC, and it contains around 12200 entries. My first (and most time consuming) task will be to even up this difference, and to compose a bidix. Of course, the process will be mostly unidirectional (i.e. on the sh side), because I have sufficient proficiency only in Serbo-Croatian. With the dictionaries evened up, and a better coverage, I will proceed writing disambiguation and transfer rules.

For resources on Serbo-Croatian I will be using Hrvatski jezični portal [1] , a great source on the lexicon of Croatian. As parallel corpora go, there is an aligned corpus of Multext-East Orwell's 1984 [2], however I intend to use online newspaper articles as a main source for my testing. While working on the sh-mk language pair I've obtained a Wikipedia corpus from the Serbo-Croatian,Bosnian,Croatian and Serbian wikipedia, if needed I'll use the same method to extract a corpus for Slovene. For the Slovene literary language there is an abundance of resources [3], along with which I will obtain a printed grammar of the language. As a bilingual dictionary I will use EUdict[4].

For the morphological disambiguation rules I will build on my previous year's work on Constraint Grammar for Serbo-Croatian. There are slight differences in morphological ambiguities between Slovene and Serbo-Croatian (notably the structure of declension, which is more complicated for plural and dual in Slovene), however I believe that with minor adaptations it will also be possible to efficiently aplly the CG rules to Slovene.

The transfer rules will be fairly straightforward to write. The main differences in grammars are in the dual number (which does not exist in Serbo-Croatian, save for rudimentary remnants in number phrases) and the future tense (the Slovene future tense is a cognate construction to Serbo-Croatian future II, while Serbo-Croatian uses a different compound construct), so most of the transfer rules will be about clitic and word reordering.

Work already done

Community bonding period

- written a stream processor for the coding challenge

- a study on the work done on apertium-sl-es, extracting statistics and reading the language-pair report

- done a rudimentary checklist of the most important books and websites to be used as resources

Work To do

Before the coding period:

- obtain a printed grammar of Slovene

- make a skeleton language pair apertium-sh-sl in the incubator

- study on the tagset convention differences to minimize complications later

The coding period:

- Week 1: Even up the coverage of closed word categories, define a mapping for the tagset later to be added in transfer. Start with open word categories - even up the coverage of verbs (SL: 3500 / SH: 1387)

- Week 2-3: Even up the coverage of nouns (SL: 6681 / SH: 2333), and proper nouns (SL: 530 / SH: 1776)

- Week 4-5: Even up the coverage of adjectives (SL: 4440 / SH: 2898)

- Week 6-7: Even up the coverage of adverbs (SL: 2186 / SH: 1659)

(The words will be added to the bidix in parallel to the work on the morphological lexicon)

- Deliverable #1 : Evened up morphological lexicons, a bidix, and tagset mapping defined for transfer.

- Week 8-9: Write additional disambiguation rules, and adapt some for the Slovene side

- Week 10-11: Write transfer rules for word reordering and verb tenses

- Deliverable #2 : Transfer and disambiguation rules

- Week 12: Final cleanup and testing on a corpus, final touch on the documentation

- Deliverable #3 : A complete sh - sl bidirectional translation system, with disambiguation and documentation

Non-GSoC activities

Lectures for 12 hours a week, no other fixed commitments at present.


I am an Graduate student of Computer Science and Mathematics at the Faculty of Science, University of Zagreb.

During my courses I have worked with C/C++, Python, C\#, Java, JavaScript, PHP + CSS + HTML, XML, SQL, Coq... Besides Coq, I also have a basic knowledge of functional programming through Haskell and the GF formalism. Currently I am writing my master thesis on disambiguation for the Croatian language with Constraint Grammar.

I have worked on the language pair apertium-sh-mk for the GSoC of 2011., and have been a mentor for Google Code-In 2011 for several tasks involving that and similar language pairs.

Regarding the technologies used in machine translation I've been enrolled in courses with finite state machines, and context free grammars (implementation of a parser using yacc+flex), and machine learning.