Difference between revisions of "Ideas for Google Summer of Code/Rule-based finite-state disambiguation"

Latest revision as of 00:53, 24 March 2013

Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.

For some languages, bigram/trigram POS disambiguation really doesn't work, especially when you want to disambiguate morphology (e.g. number, case) along with part-of-speech. So far we've been using constraint grammar for some of these languages. But although Constraint Grammar is great and powerful, it is also pretty slow. It would be a good idea to look at LanguageTool,^[1] and IceParser^[2] and Apertium's own apertium-lex-tools to get ideas on how this might be accomplished.

Tasks[edit]

Define an XML format for writing finite-state constraint rules.
Write a compiler which turns these rules into a binary finite-state representation.
Write a processor which applies these rules to an Apertium input stream.

Coding challenge[edit]

Write a stream processor (see Apertium stream format) for the output of lt-proc that parses character by character, respecting superblanks.

Frequently asked questions[edit]

none yet, ask us something! :)

Notes[edit]

[1] ttp://www.languagetool.org/

[2] ttp://nlp.ru.is/projects.htm

[1]

[2]

@@ Line 1: / Line 1: @@
 {{TOCD}}
+Currently Apertium only has a bigram/trigram part-of-speech tagger. The objective of this task would be to implement a disambiguation framework for Apertium that can be expressed as a finite-state transducer. It might be a good idea to express this as constraint rules, in a novel XML-based file format.
+For some languages, bigram/trigram POS disambiguation really doesn't work, especially when you want to disambiguate morphology (e.g. number, case) along with part-of-speech. So far we've been using constraint grammar for some of these languages. But although Constraint Grammar is great and powerful, it is also pretty slow. It would be a good idea to look at LanguageTool,<ref>http://www.languagetool.org/</ref> and IceParser<ref>http://nlp.ru.is/projects.htm</ref> and Apertium's own [[apertium-lex-tools]] to get ideas on how this might be accomplished.
 ==Tasks==
+* Define an XML format for writing finite-state constraint rules.
+* Write a compiler which turns these rules into a binary finite-state representation.
+* Write a processor which applies these rules to an Apertium input stream.
 ==Coding challenge==
+* Write a stream processor (see [[Apertium stream format]]) for the output of <code>lt-proc</code> that parses character by character, respecting [[superblanks]].
 ==Frequently asked questions==
+* none yet, ''[[contact|ask us]] something!'' :)
-==Previous GSOC projects==
+==See also==
+* [[User:Krvoje/Application2012]]
+* [[User:Krvoje/Foma script for testing finite-state disambiguation]] (Partially working implemention using [[Foma]])
+==Notes==
+<references/>
 [[Category:Ideas for Google Summer of Code|Rule-based finite-state disambiguation]]

Difference between revisions of "Ideas for Google Summer of Code/Rule-based finite-state disambiguation"

Latest revision as of 00:53, 24 March 2013

Contents

Tasks[edit]

Coding challenge[edit]

Frequently asked questions[edit]

See also[edit]

Notes[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools