Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
==First steps== |
|||
<pre> |
|||
<spectei> so, for that i would recommend you install the is-en language pair |
|||
<spectei> and the multiword-reorder module |
|||
<spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder |
|||
<spectei> and start "playing" |
|||
<spectei> some of the features work |
|||
<spectei> but it is poorly documented |
|||
<spectei> and isn't yet included in any language pair |
|||
<spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en |
|||
<spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN |
|||
</pre> |
|||
==Old hack== |
|||
Here is a cheap hack for how to deal with analysing |
Here is a cheap hack for how to deal with analysing |
||
discontiguous multiword units when translating from Germanic languages. |
discontiguous multiword units when translating from Germanic languages. |
Revision as of 12:14, 26 March 2011
First steps
<spectei> so, for that i would recommend you install the is-en language pair <spectei> and the multiword-reorder module <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder <spectei> and start "playing" <spectei> some of the features work <spectei> but it is poorly documented <spectei> and isn't yet included in any language pair <spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en <spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN
Old hack
Here is a cheap hack for how to deal with analysing discontiguous multiword units when translating from Germanic languages.
For example, vísa manninum frá landinu -> vísa# frá manninum landinu 'deport the man from the country' vísa ekki frá -> vísa# frá ekki 'deport not' The idea is to distinguish verbs which can be parts of discontiguous multiwords, and particles/adverbs which can also be. For example: 1) vísa/=vísa manninum frá/~frá landinu . 2) vísa/=vísa manninum undan/~undan landinu . 3) vísa/=vísa manninum upp/~upp landinu . We will use constraint grammar rules to select the appropriate particle if a verb exists. LIST VISAPART = ~frá ~upp ; REMOVE ("=vísa") (NOT 1* VISAPART); SELECT ("=vísa") (1* VISAPART); etc. We will then use a mode of pretransfer (I suggest -m) to join the two parts thus: =vísa manninum ~frá landinu -> vísa# frá manninum landinu 'If LU starts with =, read buffering until ~ or ." The '.<sent>' will be considered a hard delimiter, so that if no particle is found in the sentence, the buffered part is output without the initial '='. Initial ~ and = found without both parts will be stripped. Benefits: Can be implemented now in a backwards compatible way. Drawbacks: Might be too simple ? Creates more dependencies on CG ?