Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"
Jump to navigation
Jump to search
Line 23: | Line 23: | ||
==Coding challenge== |
==Coding challenge== |
||
==Old hack== |
|||
Here is a cheap hack for how to deal with analysing |
|||
discontiguous multiword units when translating from Germanic languages. |
|||
<pre> |
|||
For example, |
|||
vísa manninum frá landinu -> vísa# frá manninum landinu |
|||
'deport the man from the country' |
|||
vísa ekki frá -> vísa# frá ekki |
|||
'deport not' |
|||
The idea is to distinguish verbs which can be parts of discontiguous |
|||
multiwords, and particles/adverbs which can also be. For example: |
|||
1) vísa/=vísa manninum frá/~frá landinu . |
|||
2) vísa/=vísa manninum undan/~undan landinu . |
|||
3) vísa/=vísa manninum upp/~upp landinu . |
|||
We will use constraint grammar rules to select the appropriate particle |
|||
if a verb exists. |
|||
LIST VISAPART = ~frá ~upp ; |
|||
REMOVE ("=vísa") (NOT 1* VISAPART); |
|||
SELECT ("=vísa") (1* VISAPART); |
|||
etc. |
|||
We will then use a mode of pretransfer (I suggest -m) to join the two |
|||
parts thus: |
|||
=vísa manninum ~frá landinu -> vísa# frá manninum landinu |
|||
'If LU starts with =, read buffering until ~ or ." |
|||
The '.<sent>' will be considered a hard delimiter, so that if no |
|||
particle is found in the sentence, the buffered part is output without |
|||
the initial '='. |
|||
Initial ~ and = found without both parts will be stripped. |
|||
Benefits: Can be implemented now in a backwards compatible way. |
|||
Drawbacks: Might be too simple ? Creates more dependencies on CG ? |
|||
</pre> |
|||
==See also== |
==See also== |
Revision as of 15:09, 4 March 2012
Objectives
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
First steps
<spectei> so, for that i would recommend you install the is-en language pair <spectei> and the multiword-reorder module <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder <spectei> and start "playing" <spectei> some of the features work <spectei> but it is poorly documented <spectei> and isn't yet included in any language pair <spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en <spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN