Difference between revisions of "Ideas for Google Summer of Code/Interface for creating tagged corpora"
Jump to navigation
Jump to search
Line 14: | Line 14: | ||
==Coding challenge== |
==Coding challenge== |
||
+ | |||
+ | * Install Apertium |
||
+ | * Train a tagger in an unsupervised manner for a language pair of your choice. |
||
+ | * For a language pair of your choice, create a manually tagged corpus for [http://www.unilang.org/ulrview.php?res=394,387 this story] in a language of your choice. Make sure it already has a morphological analyser! |
||
+ | * Now train the tagger in a supervised manner from the corpus you just tagged. |
||
==Frequently asked questions== |
==Frequently asked questions== |
Revision as of 21:24, 5 April 2013
There is a need in Apertium for most released pairs and the ones to come: better part-of-speech (POS) taggers. In my experience, training supervised taggers has never been a waste of time but all the opposite: at the same time we have quality improvement and we are creating unvaluable linguistic resources such as disambiguated tagged corpora.
Tasks
- Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
- The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
- It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)
- And, a user-friendly interface to train a supervised tagger
- Also, some way to evaluate performance of a .prob file
- Including a way to incorporate constraint grammar rules would also be nice.
- A way to take into account automatically new multiwords / different tokenisation.
Coding challenge
- Install Apertium
- Train a tagger in an unsupervised manner for a language pair of your choice.
- For a language pair of your choice, create a manually tagged corpus for this story in a language of your choice. Make sure it already has a morphological analyser!
- Now train the tagger in a supervised manner from the corpus you just tagged.
Frequently asked questions
- none yet, ask us something! :)