Ideas for Google Summer of Code/Corpus-based lexicalised feature transfer

Make a module that sits somewhere in the Apertium pipeline (somewhere after the lexical selection and before morphological generation) that sets features (e.g. tags) based on a model generated from a corpus. Sometimes we get really inadequate translations even though you'd never hear stuff like that.

One of those things is when we output something as definite when it is never used as definite. One way of dealing with this is a lot of rules and lists in transfer, but those are hard to do. So, how about looking at a corpus for information about some features like definiteness, aspect, evidentiality, impersonal/reflexive pronoun use in Romance languages etc.

Tasks

Make a corpus study of one possible feature, the treatment of which could be improved with target-language information.
Experiment with including a statistical model based on this feature in the Apertium pipeline
Make a prototype implementation (possibly in python)
Generalise the prototype to deal with other features
Come up with an efficient format for storing the model.
Implement the final program efficiently in C++.

Coding challenge

Make a stream processor (see Apertium stream format) for the output of apertium-transfer (both default/chunk possibilities) that parses character by character.

Frequently asked questions

none yet, ask us something! :)

Ideas for Google Summer of Code/Corpus-based lexicalised feature transfer

Contents

Tasks

Coding challenge

Frequently asked questions

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools