Difference between revisions of "User:Skh/Application GSoC 2010"
Line 3: | Line 3: | ||
This is a first and very rough draft. Comments are always welcome, but a lot is still missing. |
This is a first and very rough draft. Comments are always welcome, but a lot is still missing. |
||
== |
== About me == |
||
=== Name === |
|||
Sonja Krause-Harder |
Sonja Krause-Harder |
||
== Contact information == |
=== Contact information === |
||
E-mail: krauseha@gmail.com |
E-mail: krauseha@gmail.com |
||
IRC: skh on freenode |
IRC: skh on freenode |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
== Which of the published tasks are you interested in? What do you plan to do? == |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
== Motivation == |
|||
⚫ | |||
⚫ | |||
== Project == |
|||
⚫ | |||
Apertium already supports multiword lexical units (short: multiwords), but there are some important phenomena that can't be adequately handle yet: |
Apertium already supports multiword lexical units (short: multiwords), but there are some important phenomena that can't be adequately handle yet: |
||
Line 20: | Line 31: | ||
* complex multiwords |
* complex multiwords |
||
== Proposed solution == |
=== Proposed solution === |
||
* find a way to describe the various kinds of multiwords in the dictionaries |
* find a way to describe the various kinds of multiwords in the dictionaries |
||
* if necessary, enhance the DTD for that |
* if necessary, enhance the DTD for that |
||
Line 26: | Line 37: | ||
* and the other way round: in the generation phase, expand the multiwords / reorder their parts so that lt-proc -g can handle it |
* and the other way round: in the generation phase, expand the multiwords / reorder their parts so that lt-proc -g can handle it |
||
== Reasons why Google and Apertium should sponsor it == |
=== Reasons why Google and Apertium should sponsor it === |
||
== A description of how and who it will benefit in society == |
=== A description of how and who it will benefit in society === |
||
==Work plan== |
== Work plan == |
||
=== Timeline === |
|||
* Now: read code, work on any language pair (en-de because I know it, nl-de was suggested by Unhammer) to get acquainted with the system and the work of a language pair maintainer, really understand (as in: look at data stream) the various phases and what each binaries does |
* Now: read code, work on any language pair (en-de because I know it, nl-de was suggested by Unhammer) to get acquainted with the system and the work of a language pair maintainer, really understand (as in: look at data stream) the various phases and what each binaries does |
||
* Community bonding phase: start collecting more examples for multiwords that fit into my three categories, find out if there are more categories (not necessarily to be implemented as well, but to have the broader picture), build testcases / sample dictionaries / sample texts from the examples, ponder and discuss dictionary syntax / DTD changes (if any) on mailing list |
* Community bonding phase: start collecting more examples for multiwords that fit into my three categories, find out if there are more categories (not necessarily to be implemented as well, but to have the broader picture), build testcases / sample dictionaries / sample texts from the examples, ponder and discuss dictionary syntax / DTD changes (if any) on mailing list |
||
Line 55: | Line 67: | ||
* '''Project completed''' |
* '''Project completed''' |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
== List any non-Summer-of-Code plans you have for the Summer == |
=== List any non-Summer-of-Code plans you have for the Summer === |
||
University Summer term until July 24th, so for the first ~8 weeks of the program I can realistically offer 20 hours/week. After that I'll be available full-time. |
University Summer term until July 24th, so for the first ~8 weeks of the program I can realistically offer 20 hours/week. After that I'll be available full-time. |
||
I am currently working 20 hours/week for a small local software company, so I am used to managing my time. If I am accepted into the GSoC program I plan to take an unpaid leave from that job for the 12 weeks of programming. |
I am currently working 20 hours/week for a small local software company, so I am used to managing my time and handle both university and a job at the same time. If I am accepted into the GSoC program I plan to take an unpaid leave from that job for the 12 weeks of programming. |
||
, especially employment, if you are applying for |
|||
internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have |
|||
at least 30 free hours a week to develop for our project. |
Revision as of 21:49, 31 March 2010
Contents
Google Summer of Code 2010: Improving multiword support in Apertium
This is a first and very rough draft. Comments are always welcome, but a lot is still missing.
About me
Name
Sonja Krause-Harder
Contact information
E-mail: krauseha@gmail.com IRC: skh on freenode
List your skills and give evidence of your qualifications.
- 7 years at SuSE Linux, Nuernberg (now Novell), without formal qualification, because I wanted to work on Linux and open source
- Software integration, RPM packaging and maintenance
- also work on "internal" (still open source) tool: http://swamp.sf.net/, designed the workflow description language and the core workflow engine
- Now 2nd year undergraduate in Linguistics (two majors actually, historical and computational linguistics).
- Most programming experience in Java, Bash. Also some C++, Perl, tcl, PHP.
- Always happy to provide references
Motivation
Why is it you are interested in machine translation?
Why is it that they are interested in the Apertium project?
Project
The problem
Apertium already supports multiword lexical units (short: multiwords), but there are some important phenomena that can't be adequately handle yet:
- discontiguous multiwords
- separable verbs (possibly just a weird variation of the above?)
- complex multiwords
Proposed solution
- find a way to describe the various kinds of multiwords in the dictionaries
- if necessary, enhance the DTD for that
- add another step after the morphological analysis, but before the tagger, that recognizes these multiwords and either changes the result of the morphological analysis, or offers the multiword analysis as another option for the tagger
- and the other way round: in the generation phase, expand the multiwords / reorder their parts so that lt-proc -g can handle it
Reasons why Google and Apertium should sponsor it
A description of how and who it will benefit in society
Work plan
Timeline
- Now: read code, work on any language pair (en-de because I know it, nl-de was suggested by Unhammer) to get acquainted with the system and the work of a language pair maintainer, really understand (as in: look at data stream) the various phases and what each binaries does
- Community bonding phase: start collecting more examples for multiwords that fit into my three categories, find out if there are more categories (not necessarily to be implemented as well, but to have the broader picture), build testcases / sample dictionaries / sample texts from the examples, ponder and discuss dictionary syntax / DTD changes (if any) on mailing list
- Week 1: Implement changes to DTD and dictionary parsing / compiling in lt-proc
- Week 2: write new module to run between lt-proc and apertium-tagger, parse compiled dictionary (?)
- Week 3:
- Week 4:
- Deliverable #1
- Week 5:
- Week 6:
- Week 7: Write detailed documentation how to use these multiwords
- Week 8:
- Deliverable #2
- Week 9:
- Week 10:
- Week 11:
- Week 12:
- Project completed
List any non-Summer-of-Code plans you have for the Summer
University Summer term until July 24th, so for the first ~8 weeks of the program I can realistically offer 20 hours/week. After that I'll be available full-time. I am currently working 20 hours/week for a small local software company, so I am used to managing my time and handle both university and a job at the same time. If I am accepted into the GSoC program I plan to take an unpaid leave from that job for the 12 weeks of programming.