Ideas for Google Summer of Code/Discontiguous multiwords
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
Another example: in Norwegian, "bryta seg inn" means "break in", while "bryta saman" means "collapse". Both these can have an NP between the main verb and the other parts. So:
- Tjuvane braut seg inn i huset = The thieves broke into the house
- Natt til i går braut tjuvane seg inn i huset = The night before yesterday, the thieves broke into the house
- Brua braut saman = The bridge collapsed
- Natt til i går braut brua saman = The night before yesterday, the bridge collapsed
When the whole phrasal verb is together, we can easily translate it as one part, so we get the right translation (broke into vs collapsed); but when they're separated by an NP, we don't have a general method to translate them as one unit, so we might end up outputting something bad like "Night to yesterday broke the bridge together"
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Multiwords would be specified in a handwritten dictionary given as input to the module
- This module would go between apertium-pretransfer and "lt-proc -b" (or between apertium-pretransfer and apertium-transfer for older pairs that don't use "lt-proc -b")
- It would reorder the words before bilingual dictionary lookup, so that e.g. "bryte<vblex> bru<n> saman<adv>" turns into "bryte# saman<vblex> bru<n>"
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
- Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
- Write a stream processor (see Apertium stream format) for the output of
apertium-tagger -p -gthat parses character by character, respecting superblanks.
- From a corpus, extract a test set of different sentences with discontiguous multiwords in.
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* right before vegetables are done . Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of the contact . He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it would >take years to clear away< . It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness . The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less than the technical and weight penalty required to *throw half of them away* mid-flight .
- When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with
*) and some are not (marked with
Frequently asked questions
- none yet, ask us something! :)