Difference between revisions of "Ideas for Google Summer of Code/Rule-based finite-state disambiguation"

From Apertium
Jump to navigation Jump to search
Line 2: Line 2:
   
 
Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.
 
Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.
  +
  +
For some languages, bigram/trigram POS disambiguation really doesn't work, especially when you want to disambiguate morphology (e.g. number, case) along with part-of-speech. So far we've been using constraint grammar for some of these languages. But although Constraint Grammar is great and powerful, it is also pretty slow. It would be a good idea to look at LanguageTool,<ref>http://www.languagetool.org/</ref> and IceParser<ref>http://nlp.ru.is/projects.htm</ref> and Apertium's own [[apertium-lex-tools]] to get ideas on how this might be accomplished.
   
 
==Tasks==
 
==Tasks==
Line 17: Line 19:
 
==Previous GSOC projects==
 
==Previous GSOC projects==
   
  +
==Notes==
  +
<references/>
   
 
[[Category:Ideas for Google Summer of Code|Rule-based finite-state disambiguation]]
 
[[Category:Ideas for Google Summer of Code|Rule-based finite-state disambiguation]]

Revision as of 15:30, 4 March 2012

Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.

For some languages, bigram/trigram POS disambiguation really doesn't work, especially when you want to disambiguate morphology (e.g. number, case) along with part-of-speech. So far we've been using constraint grammar for some of these languages. But although Constraint Grammar is great and powerful, it is also pretty slow. It would be a good idea to look at LanguageTool,[1] and IceParser[2] and Apertium's own apertium-lex-tools to get ideas on how this might be accomplished.

Tasks

  • Define an XML format for writing finite-state constraint rules.
  • Write a compiler which turns these rules into a binary finite-state representation.
  • Write a processor which applies these rules to an Apertium input stream.

Coding challenge

Frequently asked questions

Previous GSOC projects

Notes