Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"

From Apertium
Jump to navigation Jump to search
 
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
{{TOCD}}
Here is a cheap hack for how to deal with analysing
 
  +
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.
discontiguous multiword units when translating from Germanic languages.
 
<pre>
 
   
  +
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
For example,
 
   
  +
Another example: in Norwegian, "bryta seg inn" means "break in", while "bryta saman" means "collapse". Both these can have an NP between the main verb and the other parts. So:
vísa manninum frá landinu -> vísa# frá manninum landinu
 
  +
* Tjuvane braut seg inn i huset = The thieves broke into the house
'deport the man from the country'
 
  +
* Natt til i går braut tjuvane seg inn i huset = The night before yesterday, the thieves broke into the house
  +
* Brua braut saman = The bridge collapsed
  +
* Natt til i går braut brua saman = The night before yesterday, the bridge collapsed
   
  +
When the whole phrasal verb is together, we can easily translate it as one part, so we get the right translation (broke into vs collapsed); but when they're separated by an NP, we don't have a general method to translate them as one unit, so we might end up outputting something bad like "Night to yesterday broke the bridge together"
vísa ekki frá -> vísa# frá ekki
 
'deport not'
 
   
  +
==Tasks==
The idea is to distinguish verbs which can be parts of discontiguous
 
multiwords, and particles/adverbs which can also be. For example:
 
   
  +
* Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
1) vísa/=vísa manninum frá/~frá landinu .
 
  +
** Separable/phrasal verbs
  +
* Create a new FST-based module for recognising and reordering discontiguous multiword expressions
  +
** Multiwords would be specified in a handwritten dictionary given as input to the module
  +
** This module would go between apertium-pretransfer and "lt-proc -b" (or between apertium-pretransfer and apertium-transfer for older pairs that don't use "lt-proc -b")
  +
** It would reorder the words before bilingual dictionary lookup, so that e.g. "bryte<vblex> bru<n> saman<adv>" turns into "bryte# saman<vblex> bru<n>"
  +
* Include support for discontiguous multiwords in an existing language pair.
  +
** English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
   
  +
==Coding challenge==
2) vísa/=vísa manninum undan/~undan landinu .
 
   
  +
* Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
3) vísa/=vísa manninum upp/~upp landinu .
 
  +
* Write a stream processor (see [[Apertium stream format]]) for the output of <code>apertium-tagger -p -g</code> that parses character by character, respecting [[superblanks]].
   
  +
* From a corpus, extract a test set of different sentences with discontiguous multiwords in.
We will use constraint grammar rules to select the appropriate particle
 
if a verb exists.
 
   
 
<pre>
LIST VISAPART = ~frá ~upp ;
 
  +
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back*
  +
right before vegetables are done .
   
  +
Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of
REMOVE ("=vísa") (NOT 1* VISAPART);
 
  +
the contact .
SELECT ("=vísa") (1* VISAPART);
 
   
  +
He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it
etc.
 
  +
would >take years to clear away< .
   
  +
It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . .
We will then use a mode of pretransfer (I suggest -m) to join the two
 
parts thus:
 
   
  +
When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness .
=vísa manninum ~frá landinu -> vísa# frá manninum landinu
 
   
  +
The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less
'If LU starts with =, read buffering until ~ or ."
 
  +
than the technical and weight penalty required to *throw half of them away* mid-flight .
 
</pre>
   
  +
* When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with <code>*</code>) and some are not (marked with <code>&lt; &gt;</code>).
The '.<sent>' will be considered a hard delimiter, so that if no
 
particle is found in the sentence, the buffered part is output without
 
the initial '='.
 
   
  +
==Frequently asked questions==
Initial ~ and = found without both parts will be stripped.
 
 
Benefits: Can be implemented now in a backwards compatible way.
 
Drawbacks: Might be too simple ? Creates more dependencies on CG ?
 
</pre>
 
   
  +
* none yet, ''[[contact|ask us]] something!'' :)
   
 
==See also==
 
==See also==
Line 52: Line 58:
 
* [[Módulo de procesamiento de expresiones separables]]
 
* [[Módulo de procesamiento de expresiones separables]]
 
* [[Separable verbs]]
 
* [[Separable verbs]]
  +
* [[Discontiguous multiwords]]
  +
  +
[[Category:Ideas for Google Summer of Code|Discontiguous multiwords]]

Latest revision as of 12:33, 4 March 2016

The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.

In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.

Another example: in Norwegian, "bryta seg inn" means "break in", while "bryta saman" means "collapse". Both these can have an NP between the main verb and the other parts. So:

  • Tjuvane braut seg inn i huset = The thieves broke into the house
  • Natt til i går braut tjuvane seg inn i huset = The night before yesterday, the thieves broke into the house
  • Brua braut saman = The bridge collapsed
  • Natt til i går braut brua saman = The night before yesterday, the bridge collapsed

When the whole phrasal verb is together, we can easily translate it as one part, so we get the right translation (broke into vs collapsed); but when they're separated by an NP, we don't have a general method to translate them as one unit, so we might end up outputting something bad like "Night to yesterday broke the bridge together"

Tasks[edit]

  • Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
    • Separable/phrasal verbs
  • Create a new FST-based module for recognising and reordering discontiguous multiword expressions
    • Multiwords would be specified in a handwritten dictionary given as input to the module
    • This module would go between apertium-pretransfer and "lt-proc -b" (or between apertium-pretransfer and apertium-transfer for older pairs that don't use "lt-proc -b")
    • It would reorder the words before bilingual dictionary lookup, so that e.g. "bryte<vblex> bru<n> saman<adv>" turns into "bryte# saman<vblex> bru<n>"
  • Include support for discontiguous multiwords in an existing language pair.
    • English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...

Coding challenge[edit]

  • Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
  • Write a stream processor (see Apertium stream format) for the output of apertium-tagger -p -g that parses character by character, respecting superblanks.
  • From a corpus, extract a test set of different sentences with discontiguous multiwords in.
 To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* 
 right before vegetables are done . 

 Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of 
 the contact . 

 He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it 
 would >take years to clear away< . 

 It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . 

 When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness .

 The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less 
 than the technical and weight penalty required to *throw half of them away* mid-flight .
  • When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with *) and some are not (marked with < >).

Frequently asked questions[edit]

  • none yet, ask us something! :)

See also[edit]