Ideas for Google Summer of Code/Discontiguous multiwords

From Apertium
Jump to navigation Jump to search

Objectives

  • Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
    • Separable/phrasal verbs
  • Create a new FST-based module for recognising and reordering discontiguous multiword expressions
  • Include support for discontiguous multiwords in an existing language pair.
    • English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...

First steps

<spectei> so, for that i would recommend you install the is-en language pair
<spectei> and the multiword-reorder module
<spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder
<spectei> and start "playing"
<spectei> some of the features work
<spectei> but it is poorly documented
<spectei> and isn't yet included in any language pair
<spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en
<spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN

Coding challenge

  • Write a stream processor for the output of apertium-tagger -p -g that parses character by character, respecting superblanks.

See also