Difference between revisions of "Ideas for Google Summer of Code/Discontiguous multiwords"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
| Line 23: | Line 23: | ||
| ==Coding challenge== | ==Coding challenge== | ||
| ==Old hack== | |||
| Here is a cheap hack for how to deal with analysing | |||
| discontiguous multiword units when translating from Germanic languages. | |||
| <pre> | |||
| For example, | |||
| vísa manninum frá landinu -> vísa# frá manninum landinu | |||
|                              'deport   the man  from the country' | |||
| vísa ekki frá             -> vísa# frá ekki | |||
|                              'deport   not' | |||
| The idea is to distinguish verbs which can be parts of discontiguous | |||
| multiwords, and particles/adverbs which can also be. For example: | |||
| 1) vísa/=vísa manninum frá/~frá landinu . | |||
| 2) vísa/=vísa manninum undan/~undan landinu . | |||
| 3) vísa/=vísa manninum upp/~upp landinu . | |||
| We will use constraint grammar rules to select the appropriate particle | |||
| if a verb exists. | |||
| LIST VISAPART = ~frá ~upp ;  | |||
| REMOVE ("=vísa") (NOT 1* VISAPART); | |||
| SELECT ("=vísa") (1* VISAPART); | |||
| etc. | |||
| We will then use a mode of pretransfer (I suggest -m) to join the two | |||
| parts thus: | |||
| =vísa manninum ~frá landinu -> vísa# frá manninum landinu | |||
| 'If LU starts with =, read buffering until ~ or ." | |||
| The '.<sent>' will be considered a hard delimiter, so that if no | |||
| particle is found in the sentence, the buffered part is output without | |||
| the initial '='. | |||
| Initial ~ and = found without both parts will be stripped. | |||
| Benefits: Can be implemented now in a backwards compatible way. | |||
| Drawbacks: Might be too simple ? Creates more dependencies on CG ? | |||
| </pre> | |||
| ==See also== | ==See also== | ||
Revision as of 15:09, 4 March 2012
Objectives
- Create a typology of different types of discontiguous multiword expressions in Germanic, Celtic, Romance, Turkic, Uralic languages
- Separable/phrasal verbs
 
- Create a new FST-based module for recognising and reordering discontiguous multiword expressions
- Include support for discontiguous multiwords in an existing language pair.
- English-Catalan, English-Spanish, Icelandic-English, Dutch-Afrikaans, ...
 
First steps
<spectei> so, for that i would recommend you install the is-en language pair <spectei> and the multiword-reorder module <spectei> https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2010/skh/multiword-reorder <spectei> and start "playing" <spectei> some of the features work <spectei> but it is poorly documented <spectei> and isn't yet included in any language pair <spectei> the link for is-en is: https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en <spectei> http://wiki.apertium.org/wiki/Minimal_installation_from_SVN

