Difference between revisions of "Ideas for Google Summer of Code/Rule-based finite-state disambiguation"
Line 17: | Line 17: | ||
==Frequently asked questions== |
==Frequently asked questions== |
||
==See also== |
|||
==Previous GSOC projects== |
|||
* [[User:Krvoje/Application2012]] |
|||
==Notes== |
==Notes== |
Revision as of 12:43, 14 March 2013
Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.
For some languages, bigram/trigram POS disambiguation really doesn't work, especially when you want to disambiguate morphology (e.g. number, case) along with part-of-speech. So far we've been using constraint grammar for some of these languages. But although Constraint Grammar is great and powerful, it is also pretty slow. It would be a good idea to look at LanguageTool,[1] and IceParser[2] and Apertium's own apertium-lex-tools to get ideas on how this might be accomplished.
Tasks
- Define an XML format for writing finite-state constraint rules.
- Write a compiler which turns these rules into a binary finite-state representation.
- Write a processor which applies these rules to an Apertium input stream.
Coding challenge
- Write a stream processor (see Apertium stream format) for the output of
lt-proc
that parses character by character, respecting superblanks.