Difference between revisions of "Ideas for Google Summer of Code/Sliding-window part-of-speech tagger"

From Apertium
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
The idea is to implement the unsupervised part-of-speech tagger (http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging) as a drop-in replacement for the current hidden-Markov-model tagger. It should have support for unknown words, and also for "forbid" descriptions (not described in the paper). The tagger has a very intuitive interpretation (believe me, even if you find the maths a bit daunting). I am available for questions (I invented the tagger, I should be able to remember!).
+
The idea is to implement the unsupervised part-of-speech tagger ([http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging as described here]) as a drop-in replacement for the current hidden-Markov-model tagger. It should have support for unknown words, and also for "forbid" descriptions (not described in the paper). The tagger has a very intuitive interpretation (believe me, even if you find the maths a bit daunting). I ([[User:Mlforcada|Mlforcada]]) am available for questions (I invented the tagger, I should be able to remember!).
 
   
 
==Task==
 
==Task==
  +
  +
* Implement a supervised training algorithm.
  +
* Implement the tagger described in the paper.
  +
* Come up with an XML-based format for writing forbid rules.
   
 
==Coding challenge==
 
==Coding challenge==
   
Write a filter that reads in the output of Apertium morphological analyser and writes out at random one of the lexical form for each surface form in a new format, respecting [[superblanks]].
+
Write a filter that reads in the output of Apertium morphological analyser and writes out either a random one (<code>-r</code>) or the first one (<code>-f</code>) of the lexical form for each surface form in a new format, respecting [[superblanks]].
   
 
The new format would convert things as follows:
 
The new format would convert things as follows:
 
<pre>
 
<pre>
^I/I<num><mf><sg>/I<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ ^a/a<det><ind><sg>$ ^saw/saw<n><sg>/saw<vblex><inf>/saw<vblex><pres>/see<vblex><past>$^../..<sent>$
+
^I/I<num><mf><sg>/I<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$
  +
^a/a<det><ind><sg>$ ^saw/saw<n><sg>/saw<vblex><inf>/saw<vblex><pres>/see<vblex><past>$^../..<sent>$
 
</pre>
 
</pre>
  +
 
To, for instance:
 
To, for instance:
 
<pre>
 
<pre>
Line 19: Line 24:
   
 
==Frequently asked questions==
 
==Frequently asked questions==
  +
* none yet, ''[[contact|ask us]] something!'' :)
   
 
==See also==
 
==See also==

Latest revision as of 00:37, 6 April 2013

The idea is to implement the unsupervised part-of-speech tagger (as described here) as a drop-in replacement for the current hidden-Markov-model tagger. It should have support for unknown words, and also for "forbid" descriptions (not described in the paper). The tagger has a very intuitive interpretation (believe me, even if you find the maths a bit daunting). I (Mlforcada) am available for questions (I invented the tagger, I should be able to remember!).

Task[edit]

  • Implement a supervised training algorithm.
  • Implement the tagger described in the paper.
  • Come up with an XML-based format for writing forbid rules.

Coding challenge[edit]

Write a filter that reads in the output of Apertium morphological analyser and writes out either a random one (-r) or the first one (-f) of the lexical form for each surface form in a new format, respecting superblanks.

The new format would convert things as follows:

^I/I<num><mf><sg>/I<prn><subj><p1><mf><sg>$ ^have/have<vbhaver><inf>/have<vbhaver><pres>/have<vblex><inf>/have<vblex><pres>$ 
^a/a<det><ind><sg>$ ^saw/saw<n><sg>/saw<vblex><inf>/saw<vblex><pres>/see<vblex><past>$^../..<sent>$

To, for instance:

I.prn.subj.pl.mf.sg  have.vbhaver.inf a.det.ind.sg saw.n.sg ..sent

Frequently asked questions[edit]

  • none yet, ask us something! :)

See also[edit]