Difference between revisions of "Ideas for Google Summer of Code/Corpus-based lexicalised feature transfer"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
− | Make a module that sits somewhere in the Apertium pipeline (somewhere after the lexical selection and before morphological generation) that sets features ( |
+ | Make a module that sits somewhere in the Apertium pipeline (somewhere after the lexical selection and before morphological generation) that sets features (e.g. tags) based on a model generated from a corpus. Sometimes we get really inadequate translations even though you'd never hear stuff like that. |
One of those things is when we output something as definite when it is never used as definite. One way of dealing with this is a lot of rules and lists in transfer, but those are hard to do. So, how about looking at a corpus for information about some features like definiteness, aspect, evidentiality, impersonal/reflexive pronoun use in Romance languages etc. |
One of those things is when we output something as definite when it is never used as definite. One way of dealing with this is a lot of rules and lists in transfer, but those are hard to do. So, how about looking at a corpus for information about some features like definiteness, aspect, evidentiality, impersonal/reflexive pronoun use in Romance languages etc. |
||
Line 12: | Line 12: | ||
==Frequently asked questions== |
==Frequently asked questions== |
||
− | |||
− | ==Previous GSOC projects== |
||
==See also== |
==See also== |
||
Line 19: | Line 17: | ||
* [[Corpus-based definiteness transfer]] |
* [[Corpus-based definiteness transfer]] |
||
* [[Pronoun verb combinations in Romance languages]] |
* [[Pronoun verb combinations in Romance languages]] |
||
+ | * [[Choosing genitive structure in English]] |
||
[[Category:Ideas for Google Summer of Code|Closer integration with HFST]] |
[[Category:Ideas for Google Summer of Code|Closer integration with HFST]] |
Revision as of 14:47, 14 March 2013
Make a module that sits somewhere in the Apertium pipeline (somewhere after the lexical selection and before morphological generation) that sets features (e.g. tags) based on a model generated from a corpus. Sometimes we get really inadequate translations even though you'd never hear stuff like that.
One of those things is when we output something as definite when it is never used as definite. One way of dealing with this is a lot of rules and lists in transfer, but those are hard to do. So, how about looking at a corpus for information about some features like definiteness, aspect, evidentiality, impersonal/reflexive pronoun use in Romance languages etc.
Tasks
Coding challenge
- Make a stream processor (see Apertium stream format) for the output of apertium-transfer (both default/chunk possibilities) that parses character by character.