Ideas for Google Summer of Code/Discontiguous multiwords
Jump to navigation
Jump to search
Objectives
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
First steps
<spectei> so, for that i would recommend you install the is-en language pair <spectei> and the multiword-reorder module <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder <spectei> and start "playing" <spectei> some of the features work <spectei> but it is poorly documented <spectei> and isn't yet included in any language pair <spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en <spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN
Coding challenge
- Write a stream processor (see Apertium stream format) for the output of
apertium-tagger -p -g
that parses character by character, respecting superblanks.