Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"
Jump to navigation
Jump to search
Line 24: | Line 24: | ||
==Coding challenge== |
==Coding challenge== |
||
* Write a stream processor for the output of <code>apertium-tagger -p -g</code> that parses character by character, respecting superblanks. |
* Write a stream processor for the output of <code>apertium-tagger -p -g</code> that parses character by character, respecting [[superblanks]]. |
||
==See also== |
==See also== |
Revision as of 15:12, 4 March 2012
Objectives
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
First steps
<spectei> so, for that i would recommend you install the is-en language pair <spectei> and the multiword-reorder module <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder <spectei> and start "playing" <spectei> some of the features work <spectei> but it is poorly documented <spectei> and isn't yet included in any language pair <spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en <spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN
Coding challenge
- Write a stream processor for the output of
apertium-tagger -p -g
that parses character by character, respecting superblanks.