Task ideas for Google Code-in/Morphologically disambiguating text

From Apertium
< Task ideas for Google Code-in
Revision as of 12:16, 26 September 2016 by Rcrowther (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

In this page we describe how to morphologically disambiguate (tag) text so that it can be used as input to training the Apertium part-of-speech tagger.

Why do we want to do this ? -- Well, basically because the default (unsupervised) way of training (see tagger training) is not very accurate, and although for translating between closely related languages this is ok, for translating between less related languages (e.g. English--anything) it causes problems.

Example of a tagger error in the English tagger for the sentence "Where do you come from?":

^Where/Where<adv><itg>$ ^do/do<vbdo><pres>$ ^you/you<prn><subj><p2><mf><sp>$ ^come/come<vblex><pres>$ ^from<pr>$ ^?<sent>$ 
                                                                             |______________________|
                                                                                       ERROR

Input:

The input is the output of the morphological analyser (e.g. lt-proc)

^Where/Where<adv><itg>/Where<rel><adv>$
^do/do<vbdo><pres>/do<vblex><inf>/do<vblex><pres>$
^you/you<prn><subj><p2><mf><sp>/you<prn><obj><p2><mf><sp>$
^come/come<vblex><inf>/come<vblex><pres>/come<vblex><pp>$
^from/from<pr>$
^?/?<sent>$ 

Output:

You then edit that to remove impossible analyses, ideally leaving just one valid analysis (although this may not always be possible).

^Where/Where<adv><itg>$ 
^do/do<vbdo><pres>$ 
^you/you<prn><subj><p2><mf><sp>$ 
^come/come<vblex><inf>$ 
^from/from<pr>$ 
^?/?<sent>$