Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"
Jump to navigation
Jump to search
Line 23: | Line 23: | ||
==Coding challenge== |
==Coding challenge== |
||
− | |||
− | ==Old hack== |
||
− | |||
− | Here is a cheap hack for how to deal with analysing |
||
− | discontiguous multiword units when translating from Germanic languages. |
||
− | <pre> |
||
− | |||
− | For example, |
||
− | |||
− | vísa manninum frá landinu -> vísa# frá manninum landinu |
||
− | 'deport the man from the country' |
||
− | |||
− | vísa ekki frá -> vísa# frá ekki |
||
− | 'deport not' |
||
− | |||
− | The idea is to distinguish verbs which can be parts of discontiguous |
||
− | multiwords, and particles/adverbs which can also be. For example: |
||
− | |||
− | 1) vísa/=vísa manninum frá/~frá landinu . |
||
− | |||
− | 2) vísa/=vísa manninum undan/~undan landinu . |
||
− | |||
− | 3) vísa/=vísa manninum upp/~upp landinu . |
||
− | |||
− | We will use constraint grammar rules to select the appropriate particle |
||
− | if a verb exists. |
||
− | |||
− | LIST VISAPART = ~frá ~upp ; |
||
− | |||
− | REMOVE ("=vísa") (NOT 1* VISAPART); |
||
− | SELECT ("=vísa") (1* VISAPART); |
||
− | |||
− | etc. |
||
− | |||
− | We will then use a mode of pretransfer (I suggest -m) to join the two |
||
− | parts thus: |
||
− | |||
− | =vísa manninum ~frá landinu -> vísa# frá manninum landinu |
||
− | |||
− | 'If LU starts with =, read buffering until ~ or ." |
||
− | |||
− | The '.<sent>' will be considered a hard delimiter, so that if no |
||
− | particle is found in the sentence, the buffered part is output without |
||
− | the initial '='. |
||
− | |||
− | Initial ~ and = found without both parts will be stripped. |
||
− | |||
− | Benefits: Can be implemented now in a backwards compatible way. |
||
− | Drawbacks: Might be too simple ? Creates more dependencies on CG ? |
||
− | </pre> |
||
− | |||
==See also== |
==See also== |
Revision as of 15:09, 4 March 2012
Objectives
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
First steps
<spectei> so, for that i would recommend you install the is-en language pair <spectei> and the multiword-reorder module <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder <spectei> and start "playing" <spectei> some of the features work <spectei> but it is poorly documented <spectei> and isn't yet included in any language pair <spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en <spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN