User:MitchJ/Application

From Apertium
Jump to navigation Jump to search

GSoC'11: Rule-based finite-state disambiguation

Mitchell JEFFREY Monash University, Melbourne, Australia (UTC+10) IRC: mitch-j on #apertium

Introduction

Machine learning facilitates international communication, and is especially useful in supporting the continued usage of minority languages for which human translators are in short supply. Apertium represents both an academic exercise and a useful software development project whose goal it is to provide an accurate, universal automated translation engine and accompanying language-specific datasets.

My bilingual experience has taught me that each language provides an insightful, valuable and most importantly, unique way of interpreting and conceptualising the world around us. The subtle tools of expression encapsulated within a language are extremely valuable both in terms of cultural heritage and linguistic history. A sad fact of the modern world is that less frequently spoken minority languages are slowly dying off in response to the convenience and necessity required by increasingly globalised communication. Language extinction is occurring all around the world, including my native Australia which once boasted almost 1000 indigenous languages.

Machine translation is one tool of many which could help slow the decline of minority languages not only as a stand-in replacement for human translators, but also as a educational tool. Being an open source project, Apertium is in a position to support minority languages which aren’t financially viable to maintain in the commercial sense.

On a personal level, machine translation projects such as this represent a unique opportunity to combine my interests in computer science, natural language and mathematics/logic. I am interested in how humans communicate and convey meaning; investigating how machines could process human language (for instance with a translation package such as Apertium) would not only be an insightful venture in itself, but would also inform my understanding of the condition of human communication.

Synopsis

Disambiguation is an essential part of the MT process, the aim of which is to correctly identify the meaning and function of each word of input text. Apertium currently uses a bigram/trigram part-of-speech tagger, where any given word is disambiguated entirely on the categorisation of the two or three words preceding it. This project seeks to implement a complementary disambiguation framework suited to broader constraints, where rules can span entire sentences rather than just adjacent words. So-called constraint grammar (CG) rules would be processed before input text is passed on to the existing bigram/trigram apertium-tagger. In broad terms, the CG parser would be implemented as a language-independent finite state transducer (FST), which uses a language-specific constraint grammar to process text.

Benefits to Apertium

Apertium currently offers CG disambiguation with vislcg3, which does not operate on a FST model and suffers from poor performance as a result. Introducing the option of efficient constraint grammar disambiguation benefits the entire Apertium community as it is language-independent (although language-specific grammars will need to be written to take advantage of it). CG disambiguation would allow the addition of new language pairs which are currently unsuitable for parsing with the current bigram disambiguation methods. CG disambiguation also provides additional flexibility to current language pairs by allowing for rules with have broader scope - as long as an entire sentence. Ultimately a CG parser would increase the accuracy of translations.

Details The aim of the CG parser is to take natural language input text and functionally identify (disambiguate) each word of input. CG parsing is introduced at stage four of the overall translation process:

  • Preprocessing - case conversion, sentence delimitation (handled by lttoolbox)
  • Lexicon updating - identification of unknown words
  • Morphological analysis - attaching a list of possible morphological readings to each wordform (also handled by lttoolbox)
  • Local morphological disambiguation - some readings may be immediately discarded with a simple inspection. Here we introduce the CG parser, which applies constraint rules in three simultaneous phases: (1) application of disambiguation constraints; (2) assignment of clause boundaries; (3) assignment of grammatical labels such as ‘finite main verb’ or ‘subject’. Clause boundaries are identified and iterated over until there are no more changes to be made by rules in the grammar.
  • Final processing by the existing bigram tagger, which is guaranteed to leave only one reading.
  • Postprocessing (eg, reintroduction of formatting) with lttoolbox

Deliverables and Schedule

Bio