Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"
Jump to navigation
Jump to search
Line 20: | Line 20: | ||
<pre> |
<pre> |
||
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* |
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* |
||
right before vegetables are done . |
|||
⚫ | |||
⚫ | |||
⚫ | |||
the contact . |
|||
⚫ | |||
would >take years to clear away< . |
|||
It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . |
It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . |
||
When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness |
When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness . |
||
⚫ | |||
⚫ | |||
than the technical and weight penalty required to *throw half of them away* mid-flight . |
|||
</pre> |
</pre> |
||
Revision as of 12:38, 14 March 2013
Contents |
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
Tasks
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
Coding challenge
- Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
- Write a stream processor (see Apertium stream format) for the output of
apertium-tagger -p -g
that parses character by character, respecting superblanks.
- From a corpus, extract a test set of different sentences with discontiguous multiwords in.
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* right before vegetables are done . Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of the contact . He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it would >take years to clear away< . It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness . The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less than the technical and weight penalty required to *throw half of them away* mid-flight .
- When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with
*
) and some are not (marked with< >
).