Ideas for Google Summer of Code/Discontiguous multiwords

From Apertium
Jump to navigation Jump to search

The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.

In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.

Tasks

  • Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
    • Separable/phrasal verbs
  • Create a new FST-based module for recognising and reordering discontiguous multiword expressions
  • Include support for discontiguous multiwords in an existing language pair.
    • English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...

Coding challenge

  • Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
  • Write a stream processor (see Apertium stream format) for the output of apertium-tagger -p -g that parses character by character, respecting superblanks.

See also