Improving multiword support in Apertium

About me

Name

Sonja Krause-Harder

Contact information

E-mail: krauseha@gmail.com
IRC: skh on freenode
Sourceforge: skh
Apertium wiki: Skh

List your skills and give evidence of your qualifications.

I am studying computational linguistics and indo-european studies at the University of Erlangen. I'm in my second year of a three-year undergraduate program. My courses so far include formal languages, data structures and algorithms, morphological analysis (with JSLIM, see http://www.linguistik.uni-erlangen.de/clue/en/research/jslim.html) and linguistics.

Before I started studying I worked 7 years at SuSE Linux / Novell as a linux packager and software developer. I maintained RPM packages related to java development (eclipse, tomcat, jakarta project) as well as the Apache webserver, and I helped programming internally used tools.

During the initial launch of the openSUSE project I was involved in concept discussions and community relations, presenting the project externally on conferences and internally to other departments at Novell, to improve the collaboration between the openSUSE community and SuSE / Novell R&D.

Examples of my work:

A tool to transliterate devanagari in IAST or Harvard-Kyoto transliteration:
http://www.linguistik.uni-erlangen.de/~sakrause/transliterate

SWAMP: A workflow management system used internally at SuSE, I was working on the workflow definition language and the core workflow engine.
http://swamp.sf.net

Language skills

native: German, near-native: English
some: French, Czech
little: Italian, Spanish, Dutch, Icelandic
ancient: Sanskrit, Ancient Greek, some Latin

Any non-Summer-of-Code plans for the Summer

Summer term at my university finishes on July 24th, so until then I have some class work to do. Should I be accepted in the program, I will pause the student programming job (20 hours/week) which I've been doing since I started studying, and spend at least these 20 hours/week on my Google Summer of Code project. After July 24th I have no other plans but GSoC.

Motivation

Why is it you are interested in machine translation?

I have been interested in languages for a long time, and I've been already working as a programmer, so my decision to study computational linguistics was a logical conclusion. Machine translation appeals to me because to do it successfully, both current research and real-world engineering methods and consideration of efficiency are necessary.

Also, while there are already translation tools with work well for specific subject areas and languages, there is still pioneer work to be done, especially for languages that don't have that many speakers.

Why is it that you are interested in the Apertium project?

It is one of not too many projects in the field of natural language processing that are completely open source. I really believe in open source, and in my opinion machine translation, as a result of linguistic research, should be accessible and usable for as many people as possible, not only those who can afford to pay for expensive proprietary translation tools.

I like the architecture: small unix tools in a chain that do one thing only and can be used differently for different language pairs.

There is a considerable variety of languages already in the project, and the project is very alive. I've already received lots of help on IRC and the mailing list and feel that Apertium is a mentoring organization that is willing to help its students, and is interested in good and usable results from its GSoC projects.

Project: Improving multiword support in Apertium

Supported multiword constructs

multiwords without inflection

compound multiwords

multiwords with inner inflection

Missing multiword constructs

The multiword module will be a separate tool that can be run for languages that need it, and be left out of others, at the discretion of the language pair maintainer.

I would like to start with analysing these types of multiwords in the disambiguated data stream, i.e. after apertium-tagger has run. There is the possibility that the POS tagger destroys a multiword by assigning any of its constituent words to a wrong category / part of speech. That I have not found a good example for it does not mean there is none. However, for the sake of simplicity, I would like to start with the disambiguated stream. Also, some constructions can be analysed by the multiword module in different ways. I would like to start with just offering the "best bet", but later add a way to output several possible analyses, and leave it to a later module to decide between them.

I would like to take the multiword definitions out of the monolingual dictionaries and put them into a separate dictionary.

I would like to add support for the following kind of multiwords to Apertium:

Complex multiwords (adj-noun)

These multiwords consist of two inflected words, typically an adjective and a noun, which agree with each other in gender, number and case. They are formally not distinguishable from any other adjective-noun pair, but their meaning can't be inferred from the constituent words, and they therefor need a separate entry in the dictionary.

Currently, multiwords of this type can be handled by a brute force approach in the monolingual dictionaries by explicitly defining the adj-noun combination in all its conjugated forms. This works fine in languages with little inflection (see the example of dirección general in the wiki), but gets increasingly ugly when a language inflects with more variation, like the slavic or some germanic languages (example: Baba Jaga in pl-??, german gelbe Rübe ("carrot"), serbian airports).

To process this type of multiword, I propose a definition similar to this one in the multilingual dictionary. Please note that the exact syntax still needs to be discussed with the core developers, so please treat the following as pseudocode to illustrate the example:

<e>
  <p>
    <l>gelbe<s n="adj"><s n="f"><s n="NUM"><s n="CASE">
       <br />Rübe<s n="n"><s n="f"><s n="NUM"><s n="CASE"></l>
    <r>gelbe<br />Rübe<s n="np"><s n="f"><s n="NUM"><s n="CASE"></r>
  </p>
</e>

Upper case tags indicate that the words have to agree in these categories, and that whatever values these tags have need to be preserved.

The monodix will only have entries for the adjective gelb and the noun Rübe on their own. The bidix will have an entry for gelbe<br />Rübe as if it was a multiword where only one constituent word is inflected.

The multiword module shall accept a disambiguated stream similar to this:

^gelbe/gelb<adj><f><pl><nomgendatacc>$^Rüben/Rübe<n><f><pl><nomgendatacc>

and output something similar to this:

^gelbe Rüben/gelbe Rübe<np><f><pl><nomgendatacc>$

Related, but more complicated example:

adj-noun with additional words in it: polish Umawiające się Strony ("contract parties") where Umawiające agrees with Strony, and się is invariable

Type "verb ... particle"

type b:

phrasal / particle verbs that are reordered depending on their position in the sentence, like V2 in icelandic. This also applies to reflexive verbs in czech, where the reflexive particle needs to be in the 2nd position in the sentence: jmenovat se -- jmenuju se Sonja -- Ona se jmenuje Sonja
the above also covers some cases of separable words, where nothing else stands between the finite verb and the particle, if the verb is intransitive and there's no additional thing in the sentence (adverbiale ergaenzung o.ae.) ankommen -- ich komme an

type c:

phrasal / particle verbs in which something else stands between the finite verb form and the particle -- to make it up
separable verbs as a special case of the above, where the particle, in some cases, is written together with the verb -- ankommen -- ich komme an -- ich komme am Bahnhof an / ich komme um sieben Uhr am Bahnhof an

Additional types, not covered unless there's extra time

(type d:

any combination of the above
ambiguous cases (the man threw off the dog who bites his hand off -> the man threw the dog, who bites his hand off, off. <- nesting, the man threw the dog, who bites his hand, off. <- commas, the man threw the dog biting his hand off. <- no way

recognize first and second, recognize third but ignore ambiguity,
generate none of these

two or more inflected words which do not agree with each other (french passé composé))

Timeline

Now: read code, work on any language pair (en-de because I know it, nl-de was suggested on IRC) to get acquainted with the system and the work of a language pair maintainer.

Community bonding phase: Define format of the multiword dictionary

Week 1: Create new tool (multiword-transfer?), parse dictionary.
Week 2: read disambiguated stream with help of existing libraries
Week 3: recognize and generate multiwords of type adj-noun
Week 4: recognize and generate multiwords of type etre invitee

Deliverable #1: working binary that can analyse and generate multiwords of type A

Week 5 and 6: recognize and generate multiwords of type koma fra and jmenovat se
Week 7 and 8: recognize and generate particle verbs and separable verbs with single words between their parts

Deliverable #2: working binary that can analyse reordering and separating multiwords

Week 9: recognize particle verbs with arbitrarily long passages between verb and particle
Week 10: generate these sentences
Week 11: work on corner cases, nested expressions and ambiguous cases
Week 12: final clean-up and release preparation

Project completed: analyse multiwords of type a, b, c, generate sentences with multiwords of type a, b, and simplified c

Reasons why Google and Apertium should sponsor it

Enhanced multiword support will make Apertium usable for more languages. As it is now, some of the multiword constructs can only be implemented with workarounds in the dictionary, and some, like separable verbs, not at all. Having support or these will improve the translation quality for many languages. Also, a logical and documented way to describe these multiwords and handle them in the engine will make the work of language-pair maintainers easier. This will lead to more languages pairs and increase the scope and impact of the Apertium project.

A description of how and who it will benefit in society

The variety of languages currently spoken is an important part of cultural diversity. But still, people need to communicate, and have access to written information that is only available in some languages -- textbooks, manuals, news. Usable, open source machine translation for a broad range of languages will be a real help in people's lives.

User:Skh/Application GSoC 2010

Contents