User:David Nemeskey/GSOC proposal 2013

From Apertium
Jump to navigation Jump to search

Note: this proposal has been accepted. The progress of project is documented here.

Contact[edit]

Name: Dávid Márk Nemeskey
E-mail address: nemeskey.david@sztaki.mta.hu
Other information that may be useful to contact you: Savior on #apertium

Why is it you are interested in machine translation?[edit]

Machine translation is a very complex problem that depends on almost all fields of natural language processing. As such, it is a very "enabling" field, and can benefit from the improvements in any component of the NLP pipeline. And of course, it is a task with a very practical and immediately perceivable result -- as opposed to e.g. POS tagging, which is mostly interesting to linguists, machine translation can actually benefit everyone.

Why is it that they are interested in the Apertium project?[edit]

In our group, we usually use (and develop) statistical tools based on the maximum entropy model and HMM. While statistical modeling is more robust, rule-based formalisms have their own advantages: they give the designer more control over the system, and are usually much faster than their ML-based counterparts. I am very interested in delving into this "complementary" world of NLP. I am especially partial to finite-state methods, and I would like to learn more of/in that field.

Which of the published tasks are you interested in?[edit]

Rule-based finite-state disambiguation[1][2]

What do you plan to do?[edit]

Reasons why Google and Apertium should sponsor it[edit]

Apertium supports two types of part-of-speech (POS) taggers: one based on bigrams/trigrams, and another on constraint grammar (CG). The reason for this is twofold: first, for some languages, ngram-based sense disambiguation doesn't work; it is where the CG formalism comes in. Second, a well-written CG can achieve higher accuracy than n-gram models, especially for large tagsets, which are more affected by the sparseness problem[3]. However, the library currently used to parse and execute CG rules, VISL CG3, is an order of magnitude slower than the other method[4]. This project aims to alleviate this performance problem by converting CG rules to finite-state transducers (FSTs). According to [3], an FST-based implementation offers two benefits:

  1. worst-case parsing time in the number of words is quadratic, not cubic;
  2. the number of rules can be reduced by joining similar FSTs;
  3. new debugging facilities, such as detection of vacuous rules.

Representing CG-rules as FSTs, while not a novel idea[5], has only been seriously considered recently[3][6]; as such, this field could benefit greatly from further research.

How and who it will benefit in society[edit]

Faster disambiguation means better overall performance. For the languages that already use CG, it enables the translation of longer texts; for languages that do not, it allows the transition to a potentially more expressive formalism with no discernible drop in parsing speed. Furthermore, n-gram models require large, manually annotated corpora to train from. For many (e.g. minority) languages, the cost to create such a resource might be out of reach. In this case, developing a CG grammar and using it in Apertium might be a faster and more available alternative.

Detailed work plan[edit]

The task can be broken down to the following parts:

  1. A CG -> binary FST compiler. I plan to use Mans Hulden's proof-of-concept code[7], and extend it as required.
  2. The disambiguator program: an executable that links libfoma. It applies the FSTs to an appertium stream
  3. The implementation will be tested on the several language pairs to ensure that it produces the same result as VISL-CG3. The first of these will be Welsh--English; as it is rather basic, Breton--French and Serbo-Croatian--Slovenian will follow later, so that the more advanced CG features could be tested as well.
  4. An XML language for the description of the CG grammar (the current compiler uses a raw CG description format, which is what I plan to use in steps 1-3)
  5. Performance testing and tweaking of the compiler to find out how to reduce the number of rules, and speed up parsing. Basically, we have to determine which rule-FSTs can be combined into one without excessive growth in states and transitions.
  6. It might be possible to decrease the number of rules that we try to apply in an iteration.
    • For example, rules with careful contexts need not be tested for if no cohort would pass that test. The outline of the implementation is:
      • Create a {tag/set: rule} map, (as well as its reverse) that pairs careful contexts with rules; an O(G) operation, but it can be built at start-up time.
      • Read the input and enable only those "careful" rules that have a chance of succeeding. This step can be completed in O(nk).
      • In each iteration, check if more "careful" rules become applicable according to the conditions above. I hope this can be completed in O(k) time, but it might only be possible in O(nk) -- however, since one iteration is O(Gnk), even then the overhead should be negligible.
    • Another option is to filter rules that cannot be applied at all to the sentence (window) in question because of e.g. a tested tag is missing from all readings. This operation can be completed in O(G) (start-up) + O(nk) (runtime) time.

Rough timeline[edit]

Community bonding period: study the fomacg implementation. Literature review (CG, grammars on FSTs): [5], references in [3]. Play around with Apertium.

Week 1: test fomacg on all CG grammars used in Apertium, and correct any errors found (currently it cannot parse the es-base.rle file distributed with it).

Week 2-3: writing the disambiguator program. Basic implementation:

  • reads the binary transducers created by fomacg
  • loads them via libfoma
  • runs the rule FSTs sequentially
  • applies the ruleset repeatedly, until the no rules in the set apply.

Deliverable #1: the disambiguator program.

Week 4-5: integration of the program to the Apertium pipeline. Integration and initial performance testing (the results be may not so good at this point, due to the simplicity of the implementation above).

Week 6-7: improve the disambiguator:

  • parallel execution of rules
  • come up with heuristics (remove rules from the list that are not applicable, etc.)
  • profiling

Week 8-9: improve the compiler:

  • detect and remove vacuous rules
  • come up with a method that finds rules that can be composed without the unnecessary growth of the resulting net

Deliverable #2: optimized compiler and disambiguator, integrated into the Apertium pipeline.

Week 10-11: designing and writing the XML format for the CG, with a tool that converts from/to raw CG files.

Deliverable #3: an XML language for CG rules.

Week 12: buffer for item 6 in the list above, vacation and remaining work (esp. since "improving the compiler" might be open-ended).

I intend to write the documentation as I go.

Short self-introduction[edit]

I am a PhD student at the Informatics Doctoral School, Eötvös Loránd University, Budapest, Hungary. My fields of study are natural language processing and information retrieval, and especially the intersection of the two. I am also working at the Institute for Computer Science and Control, Hungarian Academy of Sciences in the Data Mining and Search Research Group. I have an MSc with distinction from the Budapest University of Technology and Economics in technical informatics.

I consider myself fluent in C++, Java and Python. I am a Sun Certified Programmer (SCJP) and Developer (SCJD) for the Java 5.0 Platform. I have read the Finite State Morphology book[8] and hence I am familiar with both the xfst and the lexc formalisms.

I am also interested in natural languages. Aside from English and Hungarian, my mother tongue, I speak Japanese and right now I am trying to refresh my German.

As for open source experience, aside from opening numerous bugs for Ubuntu and KDE, I have contributed small changes to NLTK (see http://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/hunpos.py) and gensim (https://github.com/DavidNemeskey/gensim/commits/master). I also played around with foma in the past, and reported two issues. Finally, I successfully participated in GSoC 2011 with the project LUCENE-2959: Implementing State of the Art Ranking for Lucene. My changes have been merged to Lucene Core 4.0.

I will do a short internship in the first week of June. I also plan to have a one-week vacation sometime during the summer. Aside from that, I am free to work on my project.

Coding challenge[edit]

http://pastebin.com/9vRAymA0

References[edit]

  1. http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code
  2. http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Rule-based_finite-state_disambiguation
  3. 3.0 3.1 3.2 3.3 Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.
  4. http://wiki.apertium.org/wiki/Apertium_and_Constraint_Grammar
  5. 5.0 5.1 Tapanainen, P. 1996. The Constraint Grammar Parser CG-2. Publications 27, Department of General Linguistics. University of Helsinki.
  6. Peltonen, J. 2011. Rajoitekielioppien toteutuksesta äärellistilaisin menetelmin [On the Implementation of Constraint Grammars Using Finite State Methods]. MA Thesis, University of Helsinki.
  7. https://apertium.svn.sourceforge.net/svnroot/apertium/branches/fomacg/
  8. Beesley, Kenneth R. and Karttunen, Lauri. 2003. Finite State Morphology. CLSI.