Ideas for Google Summer of Code/Interface for creating tagged corpora

From Apertium
< Ideas for Google Summer of Code
Revision as of 21:43, 20 March 2013 by Francis Tyers (talk | contribs) (Created page with '{{TOCD}} There is a need in Apertium for most released pairs and the ones to come: better part-of-speech (POS) taggers. In my experience, training supervised taggers has never b…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

There is a need in Apertium for most released pairs and the ones to come: better part-of-speech (POS) taggers. In my experience, training supervised taggers has never been a waste of time but all the opposite: at the same time we have quality improvement and we are creating unvaluable linguistic resources such as disambiguated tagged corpora.

Tasks

  • Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
  • The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
  • It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)
  • And, a user-friendly interface to train a supervised tagger
  • Also, some way to evaluate performance of a .prob file
  • Including a way to incorporate constraint grammar rules would also be nice.
  • A way to take into account automatically new multiwords / different tokenisation.

Coding challenge

Frequently asked questions

See also