Ideas for Google Summer of Code/Discontiguous multiwords
< Ideas for Google Summer of Code
Jump to navigation
Jump to search
Revision as of 12:15, 5 March 2012 by Francis Tyers (talk | contribs)
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
Objectives
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
First steps
<spectei> so, for that i would recommend you install the is-en language pair <spectei> and the multiword-reorder module <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder <spectei> and start "playing" <spectei> some of the features work <spectei> but it is poorly documented <spectei> and isn't yet included in any language pair <spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en <spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN
Coding challenge
- Write a stream processor (see Apertium stream format) for the output of
apertium-tagger -p -g
that parses character by character, respecting superblanks.