Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"

From Apertium
Jump to navigation Jump to search
Line 23: Line 23:
   
 
==Coding challenge==
 
==Coding challenge==
 
==Old hack==
 
 
Here is a cheap hack for how to deal with analysing
 
discontiguous multiword units when translating from Germanic languages.
 
<pre>
 
 
For example,
 
 
vísa manninum frá landinu -> vísa# frá manninum landinu
 
'deport the man from the country'
 
 
vísa ekki frá -> vísa# frá ekki
 
'deport not'
 
 
The idea is to distinguish verbs which can be parts of discontiguous
 
multiwords, and particles/adverbs which can also be. For example:
 
 
1) vísa/=vísa manninum frá/~frá landinu .
 
 
2) vísa/=vísa manninum undan/~undan landinu .
 
 
3) vísa/=vísa manninum upp/~upp landinu .
 
 
We will use constraint grammar rules to select the appropriate particle
 
if a verb exists.
 
 
LIST VISAPART = ~frá ~upp ;
 
 
REMOVE ("=vísa") (NOT 1* VISAPART);
 
SELECT ("=vísa") (1* VISAPART);
 
 
etc.
 
 
We will then use a mode of pretransfer (I suggest -m) to join the two
 
parts thus:
 
 
=vísa manninum ~frá landinu -> vísa# frá manninum landinu
 
 
'If LU starts with =, read buffering until ~ or ."
 
 
The '.<sent>' will be considered a hard delimiter, so that if no
 
particle is found in the sentence, the buffered part is output without
 
the initial '='.
 
 
Initial ~ and = found without both parts will be stripped.
 
 
Benefits: Can be implemented now in a backwards compatible way.
 
Drawbacks: Might be too simple ? Creates more dependencies on CG ?
 
</pre>
 
 
   
 
==See also==
 
==See also==

Revision as of 15:09, 4 March 2012

Objectives

  • Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
    • Separable/phrasal verbs
  • Create a new FST-based module for recognising and reordering discontiguous multiword expressions
  • Include support for discontiguous multiwords in an existing language pair.
    • English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...

First steps

<spectei> so, for that i would recommend you install the is-en language pair
<spectei> and the multiword-reorder module
<spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder
<spectei> and start "playing"
<spectei> some of the features work
<spectei> but it is poorly documented
<spectei> and isn't yet included in any language pair
<spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en
<spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN

Coding challenge

See also