Ideas for Google Summer of Code/Discontiguous multiwords
< Ideas for Google Summer of Code
Jump to navigation
Jump to search
Revision as of 12:19, 14 March 2013 by Francis Tyers (talk | contribs)
Contents |
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
Tasks
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
Coding challenge
- Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
- Write a stream processor (see Apertium stream format) for the output of
apertium-tagger -p -g
that parses character by character, respecting superblanks.