Difference between revisions of "Ideas for Google Summer of Code/Rule-based finite-state disambiguation"
(Created page with '{{TOCD}} ==Tasks== ==Coding challenge== ==Frequently asked questions== ==Previous GSOC projects== [[Category:Ideas for Google Summer of Code|Rule-based finite-state disambi…') |
|||
(6 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
+ | |||
+ | Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format. |
||
+ | |||
+ | For some languages, bigram/trigram POS disambiguation really doesn't work, especially when you want to disambiguate morphology (e.g. number, case) along with part-of-speech. So far we've been using constraint grammar for some of these languages. But although Constraint Grammar is great and powerful, it is also pretty slow. It would be a good idea to look at LanguageTool,<ref>http://www.languagetool.org/</ref> and IceParser<ref>http://nlp.ru.is/projects.htm</ref> and Apertium's own [[apertium-lex-tools]] to get ideas on how this might be accomplished. |
||
==Tasks== |
==Tasks== |
||
+ | |||
+ | * Define an XML format for writing finite-state constraint rules. |
||
+ | * Write a compiler which turns these rules into a binary finite-state representation. |
||
+ | * Write a processor which applies these rules to an Apertium input stream. |
||
==Coding challenge== |
==Coding challenge== |
||
+ | |||
+ | * Write a stream processor (see [[Apertium stream format]]) for the output of <code>lt-proc</code> that parses character by character, respecting [[superblanks]]. |
||
==Frequently asked questions== |
==Frequently asked questions== |
||
+ | * none yet, ''[[contact|ask us]] something!'' :) |
||
− | ==Previous GSOC projects== |
||
+ | |||
+ | ==See also== |
||
+ | |||
+ | * [[User:Krvoje/Application2012]] |
||
+ | * [[User:Krvoje/Foma script for testing finite-state disambiguation]] (Partially working implemention using [[Foma]]) |
||
+ | ==Notes== |
||
+ | <references/> |
||
[[Category:Ideas for Google Summer of Code|Rule-based finite-state disambiguation]] |
[[Category:Ideas for Google Summer of Code|Rule-based finite-state disambiguation]] |
Latest revision as of 00:53, 24 March 2013
Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.
For some languages, bigram/trigram POS disambiguation really doesn't work, especially when you want to disambiguate morphology (e.g. number, case) along with part-of-speech. So far we've been using constraint grammar for some of these languages. But although Constraint Grammar is great and powerful, it is also pretty slow. It would be a good idea to look at LanguageTool,[1] and IceParser[2] and Apertium's own apertium-lex-tools to get ideas on how this might be accomplished.
Tasks[edit]
- Define an XML format for writing finite-state constraint rules.
- Write a compiler which turns these rules into a binary finite-state representation.
- Write a processor which applies these rules to an Apertium input stream.
Coding challenge[edit]
- Write a stream processor (see Apertium stream format) for the output of
lt-proc
that parses character by character, respecting superblanks.
Frequently asked questions[edit]
- none yet, ask us something! :)
See also[edit]
- User:Krvoje/Application2012
- User:Krvoje/Foma script for testing finite-state disambiguation (Partially working implemention using Foma)