Talk:Ideas for Google Summer of Code/Discontiguous multiwords

From Apertium
Jump to navigation Jump to search

How to match the input

It could potentially be general as to whether it uses dependencies or not, so that if a pair has a good dep. analysis, it can use that, but if not it can also just use tag-specified rules. So "bryte [<n>|<det>|<adj>]* saman" is one way of matching, "bryte[id=$id] .* saman[parent=$id]" is another way of matching.

Old hack

Here is a cheap hack for how to deal with analysing discontiguous multiword units when translating from Germanic languages.


For example,

vísa manninum frá landinu -> vísa# frá manninum landinu
                             'deport   the man  from the country'

vísa ekki frá             -> vísa# frá ekki
                             'deport   not'

The idea is to distinguish verbs which can be parts of discontiguous
multiwords, and particles/adverbs which can also be. For example:

1) vísa/=vísa manninum frá/~frá landinu .

2) vísa/=vísa manninum undan/~undan landinu .

3) vísa/=vísa manninum upp/~upp landinu .

We will use constraint grammar rules to select the appropriate particle
if a verb exists.

LIST VISAPART = ~frá ~upp ; 

REMOVE ("=vísa") (NOT 1* VISAPART);
SELECT ("=vísa") (1* VISAPART);

etc.

We will then use a mode of pretransfer (I suggest -m) to join the two
parts thus:

=vísa manninum ~frá landinu -> vísa# frá manninum landinu

'If LU starts with =, read buffering until ~ or ."

The '.<sent>' will be considered a hard delimiter, so that if no
particle is found in the sentence, the buffered part is output without
the initial '='.

Initial ~ and = found without both parts will be stripped.

Benefits: Can be implemented now in a backwards compatible way.
Drawbacks: Might be too simple ? Creates more dependencies on CG ?

Grep


grep -e ' blow .* out ' -e ' bring .* down ' -e ' bring .* together ' -e ' buoy .* up ' -e ' check .* out ' -e ' churn .* up ' -e ' coil .* up ' -e ' depend .* on ' -e ' depend .* upon ' -e ' dig .* up ' -e ' dress .* up ' -e ' fill .* in ' -e ' fill .* up ' -e ' fire .* off ' -e ' foul .* up ' -e ' get .* across ' -e ' give .* back ' -e ' give .* off ' -e ' give .* up ' -e ' hike .* up ' -e ' hold .* responsible ' -e ' knock .* down ' -e ' line .* up ' -e ' make .* angry ' -e ' make .* compatible ' -e ' make .* impossible ' -e ' make .* possible ' -e ' make .* up ' -e ' move .* away ' -e ' note .* down ' -e ' patch .* up ' -e ' pay .* out ' -e ' pick .* up ' -e ' piss .* off ' -e ' pull .* down ' -e ' pull .* out ' -e ' put .* aside ' -e ' put .* off ' -e ' roll .* up ' -e ' send .* back ' -e ' serve .* up ' -e ' set .* off ' -e ' set .* up ' -e ' shake .* up ' -e ' shut .* down ' -e ' slow .* down ' -e ' stir .* up ' -e ' take .* away ' -e ' take .* off ' -e ' take .* out ' -e ' throw .* out ' -e ' trace .* back ' -e ' turn .* into ' -e ' turn .* off ' -e ' wall .* in ' -e ' wall .* off ' -e ' wear .* away ' -e ' wear .* out ' -e ' wipe .* away ' -e ' wipe .* off ' -e ' wipe .* out ' -e ' wipe .* up ' -e ' wire .* up '