Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"
(Created page with ' Here is a cheap hack for how to deal with analysing discontiguous multiword units when translating from Germanic languages. <pre> For example, vísa manninum frá landinu -> v…') |
|||
(21 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'. |
|||
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP. |
|||
Here is a cheap hack for how to deal with analysing |
|||
discontiguous multiword units when translating from Germanic languages. |
|||
⚫ | |||
Another example: in Norwegian, "bryta seg inn" means "break in", while "bryta saman" means "collapse". Both these can have an NP between the main verb and the other parts. So: |
|||
For example, |
|||
* Tjuvane braut seg inn i huset = The thieves broke into the house |
|||
* Natt til i går braut tjuvane seg inn i huset = The night before yesterday, the thieves broke into the house |
|||
* Brua braut saman = The bridge collapsed |
|||
* Natt til i går braut brua saman = The night before yesterday, the bridge collapsed |
|||
When the whole phrasal verb is together, we can easily translate it as one part, so we get the right translation (broke into vs collapsed); but when they're separated by an NP, we don't have a general method to translate them as one unit, so we might end up outputting something bad like "Night to yesterday broke the bridge together" |
|||
vísa manninum frá landinu -> vísa# frá manninum landinu |
|||
'deport the man from the country' |
|||
==Tasks== |
|||
vísa ekki frá -> vísa# frá ekki |
|||
'deport not' |
|||
* Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages |
|||
The idea is to distinguish verbs which can be parts of discontiguous |
|||
** Separable/phrasal verbs |
|||
multiwords, and particles/adverbs which can also be. For example: |
|||
* Create a new FST-based module for recognising and reordering discontiguous multiword expressions |
|||
** Multiwords would be specified in a handwritten dictionary given as input to the module |
|||
** This module would go between apertium-pretransfer and "lt-proc -b" (or between apertium-pretransfer and apertium-transfer for older pairs that don't use "lt-proc -b") |
|||
** It would reorder the words before bilingual dictionary lookup, so that e.g. "bryte<vblex> bru<n> saman<adv>" turns into "bryte# saman<vblex> bru<n>" |
|||
* Include support for discontiguous multiwords in an existing language pair. |
|||
** English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ... |
|||
==Coding challenge== |
|||
1) vísa/=vísa manninum frá/~frá landinu . |
|||
* Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans. |
|||
2) vísa/=vísa manninum undan/~undan landinu . |
|||
* Write a stream processor (see [[Apertium stream format]]) for the output of <code>apertium-tagger -p -g</code> that parses character by character, respecting [[superblanks]]. |
|||
* From a corpus, extract a test set of different sentences with discontiguous multiwords in. |
|||
3) vísa/=vísa manninum upp/~upp landinu . |
|||
⚫ | |||
We will use constraint grammar rules to select the appropriate particle |
|||
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* |
|||
if a verb exists. |
|||
right before vegetables are done . |
|||
Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of |
|||
LIST VISAPART = ~frá ~upp ; |
|||
the contact . |
|||
He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it |
|||
REMOVE ("=vísa") (NOT 1* VISAPART); |
|||
would >take years to clear away< . |
|||
SELECT ("=vísa") (1* VISAPART); |
|||
It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . |
|||
etc. |
|||
When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness . |
|||
We will then use a mode of pretransfer (I suggest -m) to join the two |
|||
parts thus: |
|||
The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less |
|||
=vísa manninum ~frá landinu -> vísa# frá manninum landinu |
|||
than the technical and weight penalty required to *throw half of them away* mid-flight . |
|||
⚫ | |||
* When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with <code>*</code>) and some are not (marked with <code>< ></code>). |
|||
'If LU starts with =, read buffering until ~ or ." |
|||
==Frequently asked questions== |
|||
The '.<sent>' will be considered a hard delimiter, so that if no |
|||
particle is found in the sentence, the buffered part is output without |
|||
the initial '='. |
|||
* none yet, ''[[contact|ask us]] something!'' :) |
|||
Initial ~ and = found without both parts will be stripped. |
|||
==See also== |
|||
Benefits: Can be implemented now in a backwards compatible way. |
|||
Drawbacks: Might be too simple ? Creates more dependencies on CG ? |
|||
* [[Módulo de procesamiento de expresiones separables]] |
|||
⚫ | |||
* [[Separable verbs]] |
|||
* [[Discontiguous multiwords]] |
|||
[[Category:Ideas for Google Summer of Code|Discontiguous multiwords]] |
Latest revision as of 12:33, 4 March 2016
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
Another example: in Norwegian, "bryta seg inn" means "break in", while "bryta saman" means "collapse". Both these can have an NP between the main verb and the other parts. So:
- Tjuvane braut seg inn i huset = The thieves broke into the house
- Natt til i går braut tjuvane seg inn i huset = The night before yesterday, the thieves broke into the house
- Brua braut saman = The bridge collapsed
- Natt til i går braut brua saman = The night before yesterday, the bridge collapsed
When the whole phrasal verb is together, we can easily translate it as one part, so we get the right translation (broke into vs collapsed); but when they're separated by an NP, we don't have a general method to translate them as one unit, so we might end up outputting something bad like "Night to yesterday broke the bridge together"
Tasks[edit]
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Multiwords would be specified in a handwritten dictionary given as input to the module
- This module would go between apertium-pretransfer and "lt-proc -b" (or between apertium-pretransfer and apertium-transfer for older pairs that don't use "lt-proc -b")
- It would reorder the words before bilingual dictionary lookup, so that e.g. "bryte<vblex> bru<n> saman<adv>" turns into "bryte# saman<vblex> bru<n>"
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
Coding challenge[edit]
- Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
- Write a stream processor (see Apertium stream format) for the output of
apertium-tagger -p -g
that parses character by character, respecting superblanks.
- From a corpus, extract a test set of different sentences with discontiguous multiwords in.
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* right before vegetables are done . Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of the contact . He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it would >take years to clear away< . It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness . The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less than the technical and weight penalty required to *throw half of them away* mid-flight .
- When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with
*
) and some are not (marked with< >
).
Frequently asked questions[edit]
- none yet, ask us something! :)