User:Mlforcada/sandbox/GSoC

From Apertium
< User:Mlforcada
Revision as of 13:00, 19 March 2010 by Mlforcada (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Task Difficulty Description Rationale Requirements Interested
mentors
Easy dictionary maintenance 2. Hard Write code that simplifies the maintenance of the single-word part of Apertium monolingual and bilingual dictionaries. This involves building an application that parses and reads the open-class (noun, adjective, verb) single-word part of the dictionary amenable to simple, data-base-like treatment, saving the remaining (hard to treat) part of the dictionaries, allows the user to easily add words (together with their inflection paradigms) through a friendly user interface and then combines the extended single-word data with the remaining data into Apertium monolingual and bilingual dictionaries ready to be compiled. Ideas and code from Apertium-dixtools could be useful. Apertium dictionaries are very heterogeneous, but a great part of the development of a language pair consists in adding single words to monolingual and bilingual dictionaries, and, indeed, work on this part of the dictionaries is crucial for coverage and usefulness. Currently, dictionary maintenance is difficult because it involves editing an XML file. This may be slowing down the development of many language pairs. Knowledge of XML, XSLT and one programming language that allows XML processing and easy writing of a user interface Mikel L. Forcada
Hybrid MT 2. Hard Building Apertium-Marclator rule-based/corpus-based hybrids Both the rule-based machine translation system Apertium and the corpus-based machine translation system Marclator do some kind of chunking of the input as well as use a relatively straightforward left-to-right machine translation strategy. This has been explored before but there are other ways to organize hybridization which should be explored (the mentor is haopy to discuss). Hybridization may make it easier to adapt Apertium to a particular corpus by using chunk pairs derived from it. Knowledge of Java, C++, and scripting languages, and appreciation for research-like coding projects Mlforcada