Difference between revisions of "Ideas for Google Summer of Code/Sliding-window part-of-speech tagger"
Line 7: | Line 7: | ||
==Coding challenge== |
==Coding challenge== |
||
− | Write a filter that reads in the output of Apertium morphological analyser and writes out at random one of the lexical form for each surface form in a new format, respecting superblanks. |
+ | Write a filter that reads in the output of Apertium morphological analyser and writes out at random one of the lexical form for each surface form in a new format, respecting [[superblanks]]. |
The new format would convert things as follows: |
The new format would convert things as follows: |
||
+ | <pre> |
||
− | |||
^I/I<num><mf><sg>/I<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ ^a/a<det><ind><sg>$ ^saw/saw<n><sg>/saw<vblex><inf>/saw<vblex><pres>/see<vblex><past>$^../..<sent>$ |
^I/I<num><mf><sg>/I<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ ^a/a<det><ind><sg>$ ^saw/saw<n><sg>/saw<vblex><inf>/saw<vblex><pres>/see<vblex><past>$^../..<sent>$ |
||
+ | </pre> |
||
− | |||
To, for instance: |
To, for instance: |
||
+ | <pre> |
||
− | |||
I.prn.subj.pl.mf.sg have.vbhaver.inf a.det.ind.sg saw.n.sg ..sent |
I.prn.subj.pl.mf.sg have.vbhaver.inf a.det.ind.sg saw.n.sg ..sent |
||
+ | </pre> |
||
==Frequently asked questions== |
==Frequently asked questions== |
Revision as of 18:15, 13 March 2013
The idea is to implement the unsupervised part-of-speech tagger (http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging) as a drop-in replacement for the current hidden-Markov-model tagger. It should have support for unknown words, and also for "forbid" descriptions (not described in the paper). The tagger has a very intuitive interpretation (believe me, even if you find the maths a bit daunting). I am available for questions (I invented the tagger, I should be able to remember!).
Task
Coding challenge
Write a filter that reads in the output of Apertium morphological analyser and writes out at random one of the lexical form for each surface form in a new format, respecting superblanks.
The new format would convert things as follows:
^I/I<num><mf><sg>/I<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ ^a/a<det><ind><sg>$ ^saw/saw<n><sg>/saw<vblex><inf>/saw<vblex><pres>/see<vblex><past>$^../..<sent>$
To, for instance:
I.prn.subj.pl.mf.sg have.vbhaver.inf a.det.ind.sg saw.n.sg ..sent