Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"

From Apertium
Jump to navigation Jump to search
Line 4: Line 4:
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.


==Objectives==
==Tasks==


* Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
* Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
Line 11: Line 11:
* Include support for discontiguous multiwords in an existing language pair.
* Include support for discontiguous multiwords in an existing language pair.
** English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
** English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...

==First steps==

<pre>
<spectei> so, for that i would recommend you install the is-en language pair
<spectei> and the multiword-reorder module
<spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder
<spectei> and start "playing"
<spectei> some of the features work
<spectei> but it is poorly documented
<spectei> and isn't yet included in any language pair
<spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en
<spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN
</pre>


==Coding challenge==
==Coding challenge==


* Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
* Write a stream processor (see [[Apertium stream format]]) for the output of <code>apertium-tagger -p -g</code> that parses character by character, respecting [[superblanks]].
* Write a stream processor (see [[Apertium stream format]]) for the output of <code>apertium-tagger -p -g</code> that parses character by character, respecting [[superblanks]].



Revision as of 12:19, 14 March 2013

The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.

In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.

Tasks

  • Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
    • Separable/phrasal verbs
  • Create a new FST-based module for recognising and reordering discontiguous multiword expressions
  • Include support for discontiguous multiwords in an existing language pair.
    • English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...

Coding challenge

  • Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
  • Write a stream processor (see Apertium stream format) for the output of apertium-tagger -p -g that parses character by character, respecting superblanks.

See also