Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"

From Apertium
Jump to navigation Jump to search
 
(18 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
{{TOCD}}
  +
The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.
   
  +
In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.
==First steps==
 
   
  +
Another example: in Norwegian, "bryta seg inn" means "break in", while "bryta saman" means "collapse". Both these can have an NP between the main verb and the other parts. So:
<pre>
 
  +
* Tjuvane braut seg inn i huset = The thieves broke into the house
<spectei> so, for that i would recommend you install the is-en language pair
 
  +
* Natt til i går braut tjuvane seg inn i huset = The night before yesterday, the thieves broke into the house
<spectei> and the multiword-reorder module
 
  +
* Brua braut saman = The bridge collapsed
<spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder
 
  +
* Natt til i går braut brua saman = The night before yesterday, the bridge collapsed
<spectei> and start "playing"
 
<spectei> some of the features work
 
<spectei> but it is poorly documented
 
<spectei> and isn't yet included in any language pair
 
<spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en
 
<spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN
 
</pre>
 
   
  +
When the whole phrasal verb is together, we can easily translate it as one part, so we get the right translation (broke into vs collapsed); but when they're separated by an NP, we don't have a general method to translate them as one unit, so we might end up outputting something bad like "Night to yesterday broke the bridge together"
==Old hack==
 
   
  +
==Tasks==
Here is a cheap hack for how to deal with analysing
 
discontiguous multiword units when translating from Germanic languages.
 
<pre>
 
   
  +
* Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
For example,
 
  +
** Separable/phrasal verbs
  +
* Create a new FST-based module for recognising and reordering discontiguous multiword expressions
  +
** Multiwords would be specified in a handwritten dictionary given as input to the module
  +
** This module would go between apertium-pretransfer and "lt-proc -b" (or between apertium-pretransfer and apertium-transfer for older pairs that don't use "lt-proc -b")
  +
** It would reorder the words before bilingual dictionary lookup, so that e.g. "bryte<vblex> bru<n> saman<adv>" turns into "bryte# saman<vblex> bru<n>"
  +
* Include support for discontiguous multiwords in an existing language pair.
  +
** English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
   
  +
==Coding challenge==
vísa manninum frá landinu -> vísa# frá manninum landinu
 
'deport the man from the country'
 
   
  +
* Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
vísa ekki frá -> vísa# frá ekki
 
  +
* Write a stream processor (see [[Apertium stream format]]) for the output of <code>apertium-tagger -p -g</code> that parses character by character, respecting [[superblanks]].
'deport not'
 
   
The idea is to distinguish verbs which can be parts of discontiguous
+
* From a corpus, extract a test set of different sentences with discontiguous multiwords in.
multiwords, and particles/adverbs which can also be. For example:
 
   
 
<pre>
1) vísa/=vísa manninum frá/~frá landinu .
 
  +
To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back*
  +
right before vegetables are done .
   
  +
Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of
2) vísa/=vísa manninum undan/~undan landinu .
 
  +
the contact .
   
  +
He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it
3) vísa/=vísa manninum upp/~upp landinu .
 
  +
would >take years to clear away< .
   
  +
It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . .
We will use constraint grammar rules to select the appropriate particle
 
if a verb exists.
 
   
  +
When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness .
LIST VISAPART = ~frá ~upp ;
 
   
  +
The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less
REMOVE ("=vísa") (NOT 1* VISAPART);
 
  +
than the technical and weight penalty required to *throw half of them away* mid-flight .
SELECT ("=vísa") (1* VISAPART);
 
 
</pre>
   
  +
* When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with <code>*</code>) and some are not (marked with <code>&lt; &gt;</code>).
etc.
 
   
  +
==Frequently asked questions==
We will then use a mode of pretransfer (I suggest -m) to join the two
 
parts thus:
 
 
=vísa manninum ~frá landinu -> vísa# frá manninum landinu
 
 
'If LU starts with =, read buffering until ~ or ."
 
 
The '.<sent>' will be considered a hard delimiter, so that if no
 
particle is found in the sentence, the buffered part is output without
 
the initial '='.
 
 
Initial ~ and = found without both parts will be stripped.
 
 
Benefits: Can be implemented now in a backwards compatible way.
 
Drawbacks: Might be too simple ? Creates more dependencies on CG ?
 
</pre>
 
   
  +
* none yet, ''[[contact|ask us]] something!'' :)
   
 
==See also==
 
==See also==
Line 69: Line 58:
 
* [[Módulo de procesamiento de expresiones separables]]
 
* [[Módulo de procesamiento de expresiones separables]]
 
* [[Separable verbs]]
 
* [[Separable verbs]]
  +
* [[Discontiguous multiwords]]
  +
  +
[[Category:Ideas for Google Summer of Code|Discontiguous multiwords]]

Latest revision as of 12:33, 4 March 2016

The task will be to develop, or adapt a module to deal with these kind of contiguous multiword expressions, for example, taking 'liggja ekki fyrir' and reordering it as 'liggja# fyrir ekki'.

In many languages, such as English, Norwegian and Icelandic, there are discontiguous multiwords, e.g. phrasal verbs, that we cannot easily support. For example 'liggja ekki fyrir' in Icelandic should be translated in English as 'to be not clear', but we cannot have 'liggja fyrir' as a traditional multiword because of the extra 'adverb', or it could even be a whole NP.

Another example: in Norwegian, "bryta seg inn" means "break in", while "bryta saman" means "collapse". Both these can have an NP between the main verb and the other parts. So:

  • Tjuvane braut seg inn i huset = The thieves broke into the house
  • Natt til i går braut tjuvane seg inn i huset = The night before yesterday, the thieves broke into the house
  • Brua braut saman = The bridge collapsed
  • Natt til i går braut brua saman = The night before yesterday, the bridge collapsed

When the whole phrasal verb is together, we can easily translate it as one part, so we get the right translation (broke into vs collapsed); but when they're separated by an NP, we don't have a general method to translate them as one unit, so we might end up outputting something bad like "Night to yesterday broke the bridge together"

Tasks[edit]

  • Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
    • Separable/phrasal verbs
  • Create a new FST-based module for recognising and reordering discontiguous multiword expressions
    • Multiwords would be specified in a handwritten dictionary given as input to the module
    • This module would go between apertium-pretransfer and "lt-proc -b" (or between apertium-pretransfer and apertium-transfer for older pairs that don't use "lt-proc -b")
    • It would reorder the words before bilingual dictionary lookup, so that e.g. "bryte<vblex> bru<n> saman<adv>" turns into "bryte# saman<vblex> bru<n>"
  • Include support for discontiguous multiwords in an existing language pair.
    • English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...

Coding challenge[edit]

  • Install a language pair where one of the languages has discontiguous multiwords, e.g. English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans.
  • Write a stream processor (see Apertium stream format) for the output of apertium-tagger -p -g that parses character by character, respecting superblanks.
  • From a corpus, extract a test set of different sentences with discontiguous multiwords in.
 To keep the meat juicy , usually a cook would *take the seared meat out* before vegetables are added , and *put the meat back* 
 right before vegetables are done . 

 Contact wearers must usually *take their contact lenses out* every night or every few days , depending on the brand and style of 
 the contact . 

 He saved the building by pointing out that the vast amount of rubble from the demolished building would so clog the streets it 
 would >take years to clear away< . 

 It needed canals only to *take goods in and out* from seagoing ships , where such rivers were unavailable . . 

 When she passed out , Rhun tried to *take her wedding ring off* to prove her unfaithfulness .

 The tanks make up such a small percentage of the total booster weight that the weight penalty of lifting them to orbit is less 
 than the technical and weight penalty required to *throw half of them away* mid-flight .
  • When you try to extract from the corpus, you should find that there are different kinds of sentences, some which are real discontiguous multiwords (marked with *) and some are not (marked with < >).

Frequently asked questions[edit]

  • none yet, ask us something! :)

See also[edit]