Difference between revisions of "Talk:Ideas for Google Summer of Code/Discontiguous multiwords"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
==How to match the input== |
|||
<pre> |
|||
It could potentially be general as to whether it uses dependencies or not, so that if a pair has a good dep. analysis, it can use that, but if not it can also just use tag-specified rules. So "bryte [<n>|<det>|<adj>]* saman" is one way of matching, "bryte[id=$id] .* saman[parent=$id]" is another way of matching. |
|||
</pre> |
|||
==Old hack== |
==Old hack== |
||
Revision as of 13:58, 18 February 2015
How to match the input
It could potentially be general as to whether it uses dependencies or not, so that if a pair has a good dep. analysis, it can use that, but if not it can also just use tag-specified rules. So "bryte [<n>|<det>|<adj>]* saman" is one way of matching, "bryte[id=$id] .* saman[parent=$id]" is another way of matching.
Old hack
Here is a cheap hack for how to deal with analysing discontiguous multiword units when translating from Germanic languages.
For example, vísa manninum frá landinu -> vísa# frá manninum landinu 'deport the man from the country' vísa ekki frá -> vísa# frá ekki 'deport not' The idea is to distinguish verbs which can be parts of discontiguous multiwords, and particles/adverbs which can also be. For example: 1) vísa/=vísa manninum frá/~frá landinu . 2) vísa/=vísa manninum undan/~undan landinu . 3) vísa/=vísa manninum upp/~upp landinu . We will use constraint grammar rules to select the appropriate particle if a verb exists. LIST VISAPART = ~frá ~upp ; REMOVE ("=vísa") (NOT 1* VISAPART); SELECT ("=vísa") (1* VISAPART); etc. We will then use a mode of pretransfer (I suggest -m) to join the two parts thus: =vísa manninum ~frá landinu -> vísa# frá manninum landinu 'If LU starts with =, read buffering until ~ or ." The '.<sent>' will be considered a hard delimiter, so that if no particle is found in the sentence, the buffered part is output without the initial '='. Initial ~ and = found without both parts will be stripped. Benefits: Can be implemented now in a backwards compatible way. Drawbacks: Might be too simple ? Creates more dependencies on CG ?
Grep
grep -e ' blow .* out ' -e ' bring .* down ' -e ' bring .* together ' -e ' buoy .* up ' -e ' check .* out ' -e ' churn .* up ' -e ' coil .* up ' -e ' depend .* on ' -e ' depend .* upon ' -e ' dig .* up ' -e ' dress .* up ' -e ' fill .* in ' -e ' fill .* up ' -e ' fire .* off ' -e ' foul .* up ' -e ' get .* across ' -e ' give .* back ' -e ' give .* off ' -e ' give .* up ' -e ' hike .* up ' -e ' hold .* responsible ' -e ' knock .* down ' -e ' line .* up ' -e ' make .* angry ' -e ' make .* compatible ' -e ' make .* impossible ' -e ' make .* possible ' -e ' make .* up ' -e ' move .* away ' -e ' note .* down ' -e ' patch .* up ' -e ' pay .* out ' -e ' pick .* up ' -e ' piss .* off ' -e ' pull .* down ' -e ' pull .* out ' -e ' put .* aside ' -e ' put .* off ' -e ' roll .* up ' -e ' send .* back ' -e ' serve .* up ' -e ' set .* off ' -e ' set .* up ' -e ' shake .* up ' -e ' shut .* down ' -e ' slow .* down ' -e ' stir .* up ' -e ' take .* away ' -e ' take .* off ' -e ' take .* out ' -e ' throw .* out ' -e ' trace .* back ' -e ' turn .* into ' -e ' turn .* off ' -e ' wall .* in ' -e ' wall .* off ' -e ' wear .* away ' -e ' wear .* out ' -e ' wipe .* away ' -e ' wipe .* off ' -e ' wipe .* out ' -e ' wipe .* up ' -e ' wire .* up '