Difference between revisions of "Ideas for Google Summer of Code/Interface for creating tagged corpora"

From Apertium
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 14: Line 14:
   
 
==Coding challenge==
 
==Coding challenge==
  +
  +
* Install Apertium
  +
* Install a language pair of your choice.
  +
* Train a tagger in an [[unsupervised tagger training|unsupervised manner]] for one of the languages in your pair.
  +
* For one of the languages in the pair, create a manually tagged corpus for [http://www.unilang.org/ulrview.php?res=394,387 this story] in a language of your choice. Make sure it already has a morphological analyser!
  +
* Now train the tagger in a supervised manner from the corpus you just tagged.
   
 
==Frequently asked questions==
 
==Frequently asked questions==
Line 20: Line 26:
 
==See also==
 
==See also==
   
  +
* [[Tagger training]]
   
 
[[Category:Ideas for Google Summer of Code|Interface for creating tagged corpora]]
 
[[Category:Ideas for Google Summer of Code|Interface for creating tagged corpora]]

Latest revision as of 21:26, 5 April 2013

There is a need in Apertium for most released pairs and the ones to come: better part-of-speech (POS) taggers. In my experience, training supervised taggers has never been a waste of time but all the opposite: at the same time we have quality improvement and we are creating unvaluable linguistic resources such as disambiguated tagged corpora.

Tasks[edit]

  • Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
  • The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
  • It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)
  • And, a user-friendly interface to train a supervised tagger
  • Also, some way to evaluate performance of a .prob file
  • Including a way to incorporate constraint grammar rules would also be nice.
  • A way to take into account automatically new multiwords / different tokenisation.

Coding challenge[edit]

  • Install Apertium
  • Install a language pair of your choice.
  • Train a tagger in an unsupervised manner for one of the languages in your pair.
  • For one of the languages in the pair, create a manually tagged corpus for this story in a language of your choice. Make sure it already has a morphological analyser!
  • Now train the tagger in a supervised manner from the corpus you just tagged.

Frequently asked questions[edit]

  • none yet, ask us something! :)

See also[edit]