User:Krvoje/Application2012

From Apertium
Jump to navigation Jump to search

GSoC application: Rule-based finite-state disambiguation

Hrvoje Peradin

hperadin@gmail.com,

krvoje on IRC: #apertium

Why is it you are interested in machine translation?

It's a perfect combination of Computer Science and Linguistics. I am very fascinated with languages, both natural or artificial. With MT it fascinates me to see a natural message transfered across a language barrier via only a process of computation. While the results are rarely perfect, it takes down communication barriers, and opens up new opportunities for learning and communication.

Why is it that you are interested in the Apertium project?

I have worked on a language pair in last year's GSoC, and it gave me great insights on rule based NLP. It gave me an invaluable chance to do real-life work on an immensely interesting topic, and to create an open-source resource. It was a great experience that taught me a lot about software development and NLP - and it also gave me the theme for my master thesis. So, I could say Apertium has a special place in my heart, and I would love to continue working on it.

Which of the published tasks are you interested in?

Writing the module for rule-based finite-state disambiguation.

Why should Google and Apertium sponsor it?

The module is intended to supplement the current bigram tagger, and Constraint Grammar, by implementing constraint based-disambiguation in a finite-state manner. Since the most common disambiguation rules can be expressed in a finite-state way, this will greatly improve speed of disambiguation, and will be beneficial for working with large texts.

How and whom it will benefit in society?

It will provide a fast tool for rule-based disambiguation, which will enable faster processing of larger corpora, and potentialy help improve translation quality in any language pair in Apertium.

What do you plan to do?

My plan is to design an XML formalism for writing disambiguation rules, a validator for it, a compiler for representing the rules as a finite-state transducer that integrates with the lttoolbox API, and a processor which applies the rules to an Apertium input stream.

The XML formalism will in effect contain a subset of Constraint Grammar rules, as much as it is possible to express with finite-state transducers. I am currently writing my master thesis on disambiguation with Constraint Grammar and I am quite familiar with principles of morphological disambiguation, so I will base the design of this formalism on my experience with Constraint Grammar.

The compiler and the processor will be written in C++, based on the designs of Apertium's transfer module, and the lexical selection module, and will use the lttoolbox API.

Work already done

Community bonding period

- written a program for the coding challenge

- started familiarising myself with lttoolbox, written a small program that composes strings and regexes into an FST

Work To do

Before the coding period:

- explore the API

- write a simple prototype, that implements a simple hardcoded rule (e.g. preposition-based case disambiguation for Serbo-Croatian)

The coding period:

- Week 1: A thorough design of the XML formalism, taking Constraint Grammar as a basis, and determining how much of it can be expressed as finite state rules (perhaps equivalent to CG-2). I will also look into other finite-state NLP systems like IceNLP, LanguageTool, and Apertium's apertium-lex-tools and transfer modules. Write simple programs that will test examples of hard-coded rules on the input stream.

The system's syntax will be based on lex-tools...

- Deliverable #1 : A complete XML formalism for expressing finite-state disambiguation rules, along with preliminary documentation.

- Week 2-8: Writing a compiler and a stream processor and integrate it with lttoolbox.

- Deliverable #2 : A compiler and processor for the complete formalism

- Week 9-12: Testing and polishing the system, writing the documentation, along with use-case examples on various languages.

- Deliverable #3 : The complete disambiguation system, with a compiler, a processor, and the documentation.

Non-GSoC activities

TODO:

Bio

I am an Graduate student of Computer Science and Mathematics at the Faculty of Science, University of Zagreb.

During my courses I have worked with C/C++, Python, C\#, Java, JavaScript, PHP + CSS + HTML, XML, SQL, Coq... Besides Coq, I also have a basic knowledge of functional programming through Haskell and the GF formalism. Currently I am writing my master thesis on disambiguation for the Croatian language with Constraint Grammar.

I have worked on the language pair apertium-sh-mk for the GSoC of 2011., and have been a mentor for Google Code-In 2011 for several tasks involving that and similar language pairs.

Regarding the technologies used in machine translation we I've been enrolled in courses with finite state machines, and context free grammars (implementation of a parser using yacc+flex), and machine learning.