Ideas for Google Summer of Code/Rule-based finite-state disambiguation
< Ideas for Google Summer of Code
Jump to navigation
Jump to search
Revision as of 15:25, 4 March 2012 by Francis Tyers (talk | contribs)
Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.
Tasks
- Define an XML format for writing finite-state constraint rules.
- Write a compiler which turns these rules into a binary finite-state representation.
- Write a processor which applies these rules to an Apertium input stream.
Coding challenge
- Write a stream processor (see Apertium stream format) for the output of
lt-proc
that parses character by character, respecting superblanks.