Difference between revisions of "User:Skh/Application GSoC 2010"
Line 42: | Line 42: | ||
* little: Italian, Spanish, Dutch, Icelandic |
* little: Italian, Spanish, Dutch, Icelandic |
||
* ancient: Sanskrit, Ancient Greek, some Latin |
* ancient: Sanskrit, Ancient Greek, some Latin |
||
⚫ | |||
⚫ | Summer term at my university finishes on July 24th, so until then I have some class work to do. Should I be accepted in the program, I will pause the student programming job (20 hours/week) which I've been doing since I started studying, and spend at least these 20 hours/week on my Google Summer of Code project. After July 24th I have no other plans but GSoC. |
||
== Motivation == |
== Motivation == |
||
Line 57: | Line 60: | ||
* honest but friendly, helpful people on IRC and mailing list |
* honest but friendly, helpful people on IRC and mailing list |
||
== Project == |
== Project: Improving multiword support in Apertium == |
||
=== The problem === |
|||
I would like to add support for the following kind of multiwords to Apertium (listed in increasing complexity and grouped into type a, b and c solely for the purpose of this project proposal): |
I would like to add support for the following kind of multiwords to Apertium (listed in increasing complexity and grouped into type a, b and c solely for the purpose of this project proposal): |
||
Line 94: | Line 97: | ||
and leave it to a later module to decide between them. |
and leave it to a later module to decide between them. |
||
== Work plan == |
|||
=== Timeline === |
=== Timeline === |
||
* Now: read code, work on any language pair (en-de because I know it, nl-de was suggested by Unhammer) to get acquainted with the system and the work of a language pair maintainer, really understand (as in: look at data stream) the various phases and what each binaries does |
* Now: read code, work on any language pair (en-de because I know it, nl-de was suggested by Unhammer) to get acquainted with the system and the work of a language pair maintainer, really understand (as in: look at data stream) the various phases and what each binaries does |
||
Line 121: | Line 123: | ||
⚫ | |||
⚫ | Summer term at my university finishes on July 24th, so until then I have some class work to do. Should I be accepted in the program, I will pause the student programming job (20 hours/week) which I've been doing since I started studying, and spend at least these 20 hours/week on my Google Summer of Code project. After July 24th I have no other plans but GSoC. |
||
=== Reasons why Google and Apertium should sponsor it === |
=== Reasons why Google and Apertium should sponsor it === |
Revision as of 07:46, 7 April 2010
Contents
Improving multiword support in Apertium
This is a first and very rough draft. Comments are always welcome, but a lot is still missing.
About me
Name
Sonja Krause-Harder
Contact information
- E-mail: krauseha@gmail.com
- IRC: skh on freenode
- Sourceforge: skh
- Apertium wiki: Skh
List your skills and give evidence of your qualifications.
I am studying computational linguistics and indo-european studies at the University of Erlangen. I'm in my second year of a three-year undergraduate program. My courses so far include formal languages, data structures and algorithms, morphological analysis (with JSLIM, see http://www.linguistik.uni-erlangen.de/clue/en/research/jslim.html) and linguistics.
Before I started studying I worked 7 years at SuSE Linux / Novell as a linux packager and software developer. I maintained RPM packages related to java development (eclipse, tomcat, jakarta project) as well as the Apache webserver, and I helped programming internally used tools.
During the initial launch of the openSUSE project I was involved in concept discussions and community relations, presenting the project externally on conferences and internally to other departments at Novell, to improve the collaboration between the openSUSE community and SuSE / Novell R&D.
Examples of my work:
- A tool to transliterate devanagari in IAST or Harvard-Kyoto transliteration:
http://www.linguistik.uni-erlangen.de/~sakrause/transliterate
- SWAMP: A workflow management system used internally at SuSE, I was working on the workflow definition language and the core workflow engine.
http://swamp.sf.net
Language skills
- native: German, near-native: English
- some: French, Czech
- little: Italian, Spanish, Dutch, Icelandic
- ancient: Sanskrit, Ancient Greek, some Latin
Any non-Summer-of-Code plans for the Summer
Summer term at my university finishes on July 24th, so until then I have some class work to do. Should I be accepted in the program, I will pause the student programming job (20 hours/week) which I've been doing since I started studying, and spend at least these 20 hours/week on my Google Summer of Code project. After July 24th I have no other plans but GSoC.
Motivation
Why is it you are interested in machine translation?
- practical application of theories learned
- real-world engineering paired with lots of current and very active research
- languages aren't going anywhere and people need to talk to each other, usable machine translation can be a real help in people's lives
Why is it that you are interested in the Apertium project?
- became interested through GSoC (if that's interesting)
- it's open source! there are thousands of open-source editors, irc clients and tetris clones, but NLP applications that are of practical use are often closed source and rather expensive. which is bad for humanity.
- I like the architecture: small unix tools in a chain that do one thing only and can be used differently for different language pairs
- variety of languages already in the project
- whether intentional or not, I think that starting with the shallow-transfer approach on pairs of very similar languages, and later widening the functionality to cover language pairs that are not as close, seems like a solid approach to me
- honest but friendly, helpful people on IRC and mailing list
Project: Improving multiword support in Apertium
I would like to add support for the following kind of multiwords to Apertium (listed in increasing complexity and grouped into type a, b and c solely for the purpose of this project proposal):
type a:
- complex multiwords which consist of two or more inflected words which agree with each other (adj-noun)
- complex multiwords which consist of two or more inflected words which do not agree with each other (french passé composé) (gender agreement not possible in generation in 1st and 2nd person and proper nouns!)
type b:
- phrasal / particle verbs that are reordered depending on their position in the sentence, like V2 in icelandic. This also applies to reflexive verbs in czech, where the reflexive particle needs to be in the 2nd position in the sentence: jmenovat se -- jmenuju se Sonja -- Ona se jmenuje Sonja
- the above also covers some cases of separable words, where nothing else stands between the finite verb and the particle, if the verb is intransitive and there's no additional thing in the sentence (adverbiale ergaenzung o.ae.) ankommen -- ich komme an
type c:
- phrasal / particle verbs in which something else stands between the finite verb form and the particle -- to make it up
- separable verbs as a special case of the above, where the particle, in some cases, is written together with the verb -- ankommen -- ich komme an -- ich komme am Bahnhof an / ich komme um sieben Uhr am Bahnhof an
(type d:
- any combination of the above
- ambiguous cases (the man threw off the dog who bites his hand off -> the man threw the dog, who bites his hand off, off. <- nesting, the man threw the dog, who bites his hand, off. <- commas, the man threw the dog biting his hand off. <- no way
- recognize first and second, recognize third but ignore ambiguity,
- generate none of these
- to piss off
The multiword module will be a separate tool that can be run for languages that need it, and be left out of others, at the discretion of the language pair maintainer.
I would like to start with analysing these types of multiwords in the disambiguated data stream, i.e. after apertium-tagger has run. There is the possibility that the POS tagger destroys a multiword by assigning any of its constituent words to a wrong category / part of speech. That I have not found a good example for it does not mean there is none. However, for the sake of simplicity, I would like to start with the disambiguated stream. Also, some constructions can be analysed by the multiword module in different ways. I would like to start with just offering the "best bet", but later add a way to output several possible analyses, and leave it to a later module to decide between them.
Timeline
- Now: read code, work on any language pair (en-de because I know it, nl-de was suggested by Unhammer) to get acquainted with the system and the work of a language pair maintainer, really understand (as in: look at data stream) the various phases and what each binaries does
- Community bonding phase: start collecting more examples for multiwords that fit into my three categories, find out if there are more categories (not necessarily to be implemented as well, but to have the broader picture), build testcases / sample dictionaries / sample texts from the examples, ponder and discuss dictionary syntax / DTD changes (if any) on mailing list
- Week 1: Implement changes to DTD and dictionary parsing / compiling in lt-proc
- Week 2: write new module to run between lt-proc and apertium-tagger, parse compiled dictionary (?)
- Week 3:
- Week 4:
- Deliverable #1
- Week 5:
- Week 6:
- Week 7: Write detailed documentation how to use these multiwords
- Week 8:
- Deliverable #2
- Week 9:
- Week 10:
- Week 11:
- Week 12:
- Project completed
Reasons why Google and Apertium should sponsor it
Enhanced multiword support will make Apertium usable for more languages. As it is now, some of the multiword constructs can only be implemented with workarounds in the dictionary, and some, like separable verbs, not at all. Having support or these will improve the translation quality for many languages. Also, a logical and documented way to describe these multiwords and handle them in the engine will make the work of language-pair maintainers easier. This will lead to more languages pairs and increase the scope and impact of the Apertium project.
A description of how and who it will benefit in society
The variety of languages currently spoken is an important part of cultural diversity. But still, people need to communicate, and have access to written information that is only available in some languages -- textbooks, manuals, news. Usable, open source machine translation for a broad range of languages will be a real help in people's lives.