Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"

From Apertium
Jump to navigation Jump to search
Line 46: Line 46:
 
Drawbacks: Might be too simple ? Creates more dependencies on CG ?
 
Drawbacks: Might be too simple ? Creates more dependencies on CG ?
 
</pre>
 
</pre>
  +
   
 
==See also==
 
==See also==

Revision as of 15:52, 13 February 2010

Here is a cheap hack for how to deal with analysing discontiguous multiword units when translating from Germanic languages.


For example,

vísa manninum frá landinu -> vísa# frá manninum landinu
                             'deport   the man  from the country'

vísa ekki frá             -> vísa# frá ekki
                             'deport   not'

The idea is to distinguish verbs which can be parts of discontiguous
multiwords, and particles/adverbs which can also be. For example:

1) vísa/=vísa manninum frá/~frá landinu .

2) vísa/=vísa manninum undan/~undan landinu .

3) vísa/=vísa manninum upp/~upp landinu .

We will use constraint grammar rules to select the appropriate particle
if a verb exists.

LIST VISAPART = ~frá ~upp ; 

REMOVE ("=vísa") (NOT 1* VISAPART);
SELECT ("=vísa") (1* VISAPART);

etc.

We will then use a mode of pretransfer (I suggest -m) to join the two
parts thus:

=vísa manninum ~frá landinu -> vísa# frá manninum landinu

'If LU starts with =, read buffering until ~ or ."

The '.<sent>' will be considered a hard delimiter, so that if no
particle is found in the sentence, the buffered part is output without
the initial '='.

Initial ~ and = found without both parts will be stripped.

Benefits: Can be implemented now in a backwards compatible way.
Drawbacks: Might be too simple ? Creates more dependencies on CG ?


See also