Revision as of 06:38, 7 April 2010

Improving multiword support in Apertium

This is a first and very rough draft. Comments are always welcome, but a lot is still missing.

About me

Name

Sonja Krause-Harder

Contact information

E-mail: krauseha@gmail.com
IRC: skh on freenode
Sourceforge: skh
Apertium wiki: Skh

List your skills and give evidence of your qualifications.

I am studying computational linguistics and indo-european studies at the University of Erlangen. I'm in my second year of a three-year undergraduate program. My courses so far include formal languages, data structures and algorithms, morphological analysis (with JSLIM, see http://www.linguistik.uni-erlangen.de/clue/en/research/jslim.html) and linguistics.

Before I started studying I worked 7 years at SuSE Linux / Novell as a linux packager and software developer. I maintained RPM packages related to java development (eclipse, tomcat, jakarta project) as well as the Apache webserver, and I helped programming internally used tools.

During the initial launch of the openSUSE project I was involved in concept discussions and community relations, presenting the project externally on conferences and internally to other departments at Novell, to improve the collaboration between the openSUSE community and SuSE / Novell R&D.

Examples of my work:

A tool to transliterate devanagari in IAST or Harvard-Kyoto transliteration:
http://www.linguistik.uni-erlangen.de/~sakrause/transliterate

SWAMP: A workflow management system used internally at SuSE, I was working on the workflow definition language and the core workflow engine.
http://swamp.sf.net

Language skills

native: German, near-native: English
some: French, Czech
little: Italian, Spanish, Dutch, Icelandic
ancient: Sanskrit, Ancient Greek, some Latin

Motivation

Why is it you are interested in machine translation?

practical application of theories learned
real-world engineering paired with lots of current and very active research
languages aren't going anywhere and people need to talk to each other, usable machine translation can be a real help in people's lives

Why is it that you are interested in the Apertium project?

became interested through GSoC (if that's interesting)
it's open source! there are thousands of open-source editors, irc clients and tetris clones, but NLP applications that are of practical use are often closed source and rather expensive. which is bad for humanity.
I like the architecture: small unix tools in a chain that do one thing only and can be used differently for different language pairs
variety of languages already in the project
whether intentional or not, I think that starting with the shallow-transfer approach on pairs of very similar languages, and later widening the functionality to cover language pairs that are not as close, seems like a solid approach to me
honest but friendly, helpful people on IRC and mailing list

Project

The problem

Apertium already supports multiword lexical units (short: multiwords), but there are some important phenomena that can't be adequately handle yet:

discontiguous multiwords
separable verbs (possibly just a weird variation of the above?)
complex multiwords

Proposed solution

find a way to describe the various kinds of multiwords in the dictionaries
if necessary, enhance the DTD for that
add another step after the morphological analysis, but before the tagger, that recognizes these multiwords and either changes the result of the morphological analysis, or offers the multiword analysis as another option for the tagger
and the other way round: in the generation phase, expand the multiwords / reorder their parts so that lt-proc -g can handle it

Reasons why Google and Apertium should sponsor it

it will make Apertium usable for more languages
improve translation quality

A description of how and who it will benefit in society

Work plan

Timeline

Now: read code, work on any language pair (en-de because I know it, nl-de was suggested by Unhammer) to get acquainted with the system and the work of a language pair maintainer, really understand (as in: look at data stream) the various phases and what each binaries does
Community bonding phase: start collecting more examples for multiwords that fit into my three categories, find out if there are more categories (not necessarily to be implemented as well, but to have the broader picture), build testcases / sample dictionaries / sample texts from the examples, ponder and discuss dictionary syntax / DTD changes (if any) on mailing list

Week 1: Implement changes to DTD and dictionary parsing / compiling in lt-proc
Week 2: write new module to run between lt-proc and apertium-tagger, parse compiled dictionary (?)
Week 3:
Week 4:

Deliverable #1

Week 5:
Week 6:
Week 7: Write detailed documentation how to use these multiwords
Week 8:

Deliverable #2

Week 9:
Week 10:
Week 11:
Week 12:

Project completed

List any non-Summer-of-Code plans you have for the Summer

University Summer term until July 24th, so for the first ~8 weeks of the program I can realistically offer 20 hours/week. After that I'll be available full-time. I am currently working 20 hours/week for a small local software company, so I am used to managing my time and handle both university and a job at the same time. If I am accepted into the GSoC program I plan to take an unpaid leave from that job for the 12 weeks of programming.

@@ Line 33: / Line 33: @@
 Examples of my work:
-* A tool to transliterate devanagari in IAST or Harvard-Kyoto transliteration:<br /><br />http://www.linguistik.uni-erlangen.de/~sakrause/transliterate <br />
+* A tool to transliterate devanagari in IAST or Harvard-Kyoto transliteration:<br />http://www.linguistik.uni-erlangen.de/~sakrause/transliterate <br />
-* SWAMP: A workflow management system used internally at SuSE, I was working on the workflow definition language and the core workflow engine.<br /><br />http://swamp.sf.net <br />
+* SWAMP: A workflow management system used internally at SuSE, I was working on the workflow definition language and the core workflow engine.<br />http://swamp.sf.net <br />
 === Language skills ===

Difference between revisions of "User:Skh/Application GSoC 2010"

Revision as of 06:38, 7 April 2010

Contents

Improving multiword support in Apertium

About me

Name

Contact information

List your skills and give evidence of your qualifications.

Language skills

Motivation

Why is it you are interested in machine translation?

Why is it that you are interested in the Apertium project?

Project

The problem

Proposed solution

Reasons why Google and Apertium should sponsor it

A description of how and who it will benefit in society

Work plan

Timeline

List any non-Summer-of-Code plans you have for the Summer

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools