Latest revision as of 19:57, 8 April 2010

Improving multiword support in Apertium[edit]

About me[edit]

Name[edit]

Sonja Krause-Harder

Contact information[edit]

E-mail: krauseha@gmail.com
IRC: skh on freenode
Sourceforge: skh
Apertium wiki: Skh

List your skills and give evidence of your qualifications.[edit]

I am studying computational linguistics and indo-european studies at the University of Erlangen. I'm in my second year of a three-year undergraduate program. My courses so far include formal languages, data structures and algorithms, morphological analysis (with JSLIM, see http://www.linguistik.uni-erlangen.de/clue/en/research/jslim.html) and linguistics.

Before I started studying I worked 7 years at SuSE Linux / Novell as a linux packager and software developer. I maintained RPM packages related to java development (eclipse, tomcat, jakarta project) as well as the Apache webserver, and I helped programming internally used tools.

During the initial launch of the openSUSE project I was involved in concept discussions and community relations, presenting the project externally on conferences and internally to other departments at Novell, to improve the collaboration between the openSUSE community and SuSE / Novell R&D.

Examples of my work:

A tool to transliterate devanagari in IAST or Harvard-Kyoto transliteration:
http://www.linguistik.uni-erlangen.de/~sakrause/transliterate

SWAMP: A workflow management system used internally at SuSE, I was working on the workflow definition language and the core workflow engine.
http://swamp.sf.net

Language skills[edit]

native: German, near-native: English
some: French, Czech
little: Italian, Spanish, Dutch, Icelandic
ancient: Sanskrit, Ancient Greek, some Latin

Any non-Summer-of-Code plans for the Summer[edit]

Summer term at my university finishes on July 24th, so until then I have some class work to do. Should I be accepted in the program, I will pause the student programming job (20 hours/week) which I've been doing since I started studying, and spend at least these 20 hours/week on my Google Summer of Code project. After July 24th I have no other plans but GSoC.

Motivation[edit]

Why is it you are interested in machine translation?[edit]

I have been interested in languages for a long time, and I've been already working as a software developer, so my decision to study computational linguistics was a logical conclusion. Machine translation appeals to me because to do it successfully, both current research and real-world engineering methods and consideration of efficiency are necessary.

Also, while there are already translation tools with work well for specific subject areas and languages, there is still pioneer work to be done, especially for languages that don't have that many speakers.

Why is it that you are interested in the Apertium project?[edit]

It is one of not too many projects in the field of natural language processing that are completely open source. I really believe in open source, and in my opinion machine translation, as a result of linguistic research, should be accessible and usable for as many people as possible, not only those who can afford to pay for expensive proprietary translation tools.

I like the architecture: small unix tools in a chain that do one thing only and can be used differently for different language pairs.

There is a considerable variety of languages already in the project, and the project is very alive. I've already received lots of help on IRC and the mailing list and feel that Apertium is a mentoring organization that is willing to help its students, and is interested in good and usable results from its GSoC projects.

Project: Improving multiword support in Apertium[edit]

Natural languages can have lexical units which consist of two or more separate words, and which as a unit, following certain composition rules, have a meaning that cannot be inferred from the meanings of their constituent parts. To handle these lexical units in apertium the concept of multiwords is used. Because the ways in which languages use multiword constructs are so varied, only some cases can be handled with the current dictionary syntax and implementation in apertium. This project aims at extending multiword support in Apertium so that two more major types of multiwords can be handled.

Supported multiword constructs[edit]

Three kinds of multiword lexical units are already supported, as explained in apertium-documentation. These are:

Multiwords without inflection: short phrases used like adverbs (Example: english at the moment)

Compound multiwords: two words concatenated into one for phonetic or orthographic reasons (Examples: spanish del < de el, english isn't < is not

Multiwords with inner inflection: groups of two or more words where one is inflected and the others unchanged (Example: english record player with plural record players, take away with past tense took away)

Missing multiword constructs[edit]

I would like to add support for the following kind of multiwords to Apertium:

Complex multiwords (adj-noun)[edit]

These multiwords consist of two inflected words, typically an adjective and a noun, which agree with each other in gender, number and case. They are formally not distinguishable from any other adjective-noun pair, but their meaning can't be inferred from the constituent words, and they therefor need a separate entry in the dictionary.

Currently, multiwords of this type can be handled in the monolingual dictionaries by explicitly defining the adj-noun combination in all its conjugated forms. This works fine in languages with little inflection (see the example for dirección general on the Multiwords wiki page), but gets increasingly complicated when a language inflects with more variation, like the slavic or some germanic languages.

More examples:

polish Baba Jaga (proper noun), see apertium/incubator/apertium-en-pl/pldic/pldix-np-ant-mw-lex.xml
german gelbe Rübe ("carrot")
serbian zračna luka ("airport"), see Multiwords

To process this type of multiword, I propose a definition similar to this one in the multilingual dictionary:

pair:
  left : gelbe(adj-f-$num-$case) Rübe(n-f-$num-$case)
  right: gelbe Rübe(np-f-$num-$case)
  agree-on: $num, $case

This is of course pseudocode, please see the discussion part of this page how this could be expressed similar to the existing apertium syntax.

The monodix will only have entries for the adjective gelb and the noun Rübe on their own. The bidix will have an entry for gelbe<br />Rübe as defined in the multiword dictionary.

The multiword module shall accept a disambiguated stream similar to this:

^gelbe/gelb<adj><f><pl><nomgendatacc>$^Rüben/Rübe<n><f><pl><nomgendatacc>

and output something similar to this:

^gelbe Rüben/gelbe Rübe<np><f><pl><nomgendatacc>$

Related, but more complicated example:

adj-noun with additional words in it: polish Umawiające się Strony ("contract parties") where Umawiające agrees with Strony, and się is invariable

Discontiguous multiwords (verb ... particle)[edit]

These multiwords consist of a verb and a particle, which do not stand next to each other in the sentence. These constructs vary wildly across languages, so flexible ways to define them in the dictionary are necessary.

An example is the english phrasal verb to throw sth. away. With increasing difficulty, sentence structures like the following need to be recognized and generated:

I throw away the letter.: This can already be handled, it is the existing type of multiword with inner inflection.
I throw it away. has a pronoun between verb and particle,
I throw the letter away. has a noun phrase, and
I throw the big red nasty letter from my brother away. still a noun phrase, but a far more elaborate one.

To describe the possible patterns of this example in the dictionary, it needs to be defined what can come between the verb and the particle (in this case, the direct object of the verb, whatever form it takes).

Another example are separable verbs, which are written as one word in their non-finite forms, but separate in their finite ones, like german ankommen ("to arrive"):

Ich komme an. ("I arrive.")
Ich komme am Bahnhof an. ("I arrive at the station.")
Ich komme morgen abend am Bahnhof an. ("I arrive at the station tomorrow evening.")
Ich komme, wenn nichts dazwischen kommt, morgen abend am Bahnhof an. ("If nothing unexpected happens, I arrive at the station tomorrow evening.")

The difference to the example above is that the particle always stands at the end of a clause, even if whole subordinate clauses are inserted, and that it is less important what kinds of phrases stand between verb and particle.

As discontiguous multiwords are much more varied than the type of complex multiwords described above, I would like to restrict their treatment in this project to phrasal like to throw away in simple sentences with either a pronoun or a simple noun phrase between the verb and the particle. After the project it can be evaluated if this approach is usable and extendable to more complicated cases or not.

Ambiguity[edit]

It is also possible to construct many ambiguous cases with discontiguous multiwords. As an example, consider the following sentences:

Unambiguous because of word order: The man threw off the dog who bites his ear off. (1)
Nested, unambiguous because of punctuation: The man threw the dog, who bites his ear off, off. (2)
Two candidate verbs to which the particle may belong, unambiguous because of punctuation: The man threw the dog, who bites his ear, off. (3)
Ambiguous: The man threw the dog biting his ear off. (4)

Sentences like (1) are what I refer to as "simple sentence" in the work plan, these should be recognized and generated correctly. (2) and (3) may be theoretically recognized, (4) is inherently ambiguous and would need to be analysed in different ways, passing the ambiguity on to another module using other approaches to resolve it.

Also, it might be best, for the more difficult cases, to only attempt to recognize, but not correctly generate them, so not all transformations will be bidirectional.

Cases (2) to (4) are most likely also beyond the scope of this project, but may be worked upon at the end if there's time left.

Additional types of multiwords[edit]

cases where a particle verb in some cases needs to be reordered and the particle comes first (because the finite verb needs to be in sentence-second position)

multiwords that are complex and discontiguous at the same time

verb constructions with auxiliary verbs (french passé composé, german perfect)

It is desired to extend the multiword module in the future so that these and other additional types of multiwords can be handled. They are beyond the main scope of this project, though. Some of them might be included if there's time left at the end.

Implementation[edit]

Formally, a finite state transducer can't handle complex multiwords as defined above, because to find out whether the two or more analysed word forms agree in the specified categories, it would need to keep track of the first match, and back-reference it in the second. This is outside regular languages (cf. Wikipedia on regular expressions). Explicitly listing all possible inflected forms works just fine of course, and if this is not desired in the dictionary definition itself, multiword definitions can be expanded automatically.

Discontiguous multiwords can be described by regular expressions, so it should be possible to describe and handle them with the existing FST libraries used in apertium.

Conceptually, the multiword module could fit well into lttoolbox, and share code for handling dictionaries and processing the data stream.

Integration into the apertium pipeline[edit]

Analysis of these two new types of multiwords should happen in a separate module. Initially I would like to run it after the POS tagger so that it can work on a disambiguated data stream. It is possible, however, that in some cases the POS tagger destroys a multiword by tagging one of its constituent words incorrectly. Because of this it might be desirable, at some point, to extend the multiword module and have it work on the data stream that still contains ambiguous analyses from the morphological analyzer. This is also most likely beyond the scope of this project.

Timeline[edit]

All tasks include regression tests and user-level documentation in the form of cookbook style examples in the wiki.

Now: read code, work on any language pair (en-de because I know it, nl-de was suggested on IRC) to get acquainted with the system and the work of a language pair maintainer, research regular expression engines

Community bonding phase: Discuss and define format of the multiword dictionary, turn examples into test cases for regression tests

Week 1 and 2: Create new tool, parse dictionary.
Week 3 and 4: read disambiguated stream with help of existing libraries

Deliverable #1: working binary that can parse the multiword dictionary and read in a disambiguated data stream

Week 5 and 6: recognize and generate complex multiwords of type adj-noun
Week 7 and 8: recognize discontiguous multiword in unambiguous simple sentences

Deliverable #2: working binary that can recognize and generate complex multiwords of type adj-noun and recognize discontiguous multiwords in unambiguous simple sentences

Week 9 and 10: generate discontiguous multiwords in simple sentences
Week 11 and 12: clean up and prepare for release

Project completed

Additional tasks, if there is time left at the end of the coding phase:

handle ambiguous cases of discontiguous multiwords
optionally process streams with ambiguous analyses still present
handle additional types of multiwords
work with language maintainers to use the new multiword types in existing language pairs

Reasons why Google and Apertium should sponsor it[edit]

Enhanced multiword support will make Apertium usable for more languages. As it is now, some of the multiword constructs can only be implemented with workarounds in the dictionary, and some, like separable verbs, not at all. Having support or these will improve the translation quality for many languages. Also, a logical and documented way to describe these multiwords and handle them in the engine will make the work of language-pair maintainers easier. This will lead to more languages pairs and increase the scope and impact of the Apertium project.

A description of how and who it will benefit in society[edit]

The variety of languages currently spoken is an important part of cultural diversity. But still, people need to communicate, and have access to written information that is only available in some languages -- textbooks, manuals, news. Usable, open source machine translation for a broad range of languages will be a real help in people's lives.

@@ Line 46: / Line 46: @@
 == Motivation ==
 === Why is it you are interested in machine translation? ===
-I have been interested in languages for a long time, and I've been already working as a programmer, so my decision to study computational linguistics was a logical conclusion. Machine translation appeals to me because to do it successfully, both current research and real-world engineering methods and consideration of efficiency are necessary.
+I have been interested in languages for a long time, and I've been already working as a software developer, so my decision to study computational linguistics was a logical conclusion. Machine translation appeals to me because to do it successfully, both current research and real-world engineering methods and consideration of efficiency are necessary.
 Also, while there are already translation tools with work well for specific subject areas and languages, there is still pioneer work to be done, especially for languages that don't have that many speakers.
@@ Line 69: / Line 69: @@
 * '''Compound multiwords''': two words concatenated into one for phonetic or orthographic reasons (Examples: spanish ''del'' < ''de el'', english ''isn't'' < ''is not''
-* '''Multiwords with inner inflection''': groups of two or more words where one is inflected and the others unchanged (Example: english ''record player'')
+* '''Multiwords with inner inflection''': groups of two or more words where one is inflected and the others unchanged (Example: english ''record player'' with plural ''record players'', ''take away'' with past tense ''took away'')
-{{comment|record player → record players :) A better example might be took away < take away - [[User:Francis Tyers|Francis Tyers]] 10:23, 8 April 2010 (UTC)}}
 === Missing multiword constructs ===
@@ Line 81: / Line 79: @@
 These multiwords consist of two inflected words, typically an adjective and a noun, which agree with each other in gender, number and case. They are formally not distinguishable from any other adjective-noun pair, but their meaning can't be inferred from the constituent words, and they therefor need a separate entry in the dictionary.
-Currently, multiwords of this type can be handled in the monolingual dictionaries by explicitly defining the adj-noun combination in all its conjugated forms. This works fine in languages with little inflection (see the example for ''dirección general'' on the [[Multiwords]] wiki page), but gets increasingly ugly when a language inflects with more variation, like the slavic or some germanic languages.
+Currently, multiwords of this type can be handled in the monolingual dictionaries by explicitly defining the adj-noun combination in all its conjugated forms. This works fine in languages with little inflection (see the example for ''dirección general'' on the [[Multiwords]] wiki page), but gets increasingly complicated when a language inflects with more variation, like the slavic or some germanic languages.
 More examples:
@@ Line 113: / Line 111: @@
 ==== Discontiguous multiwords (verb ... particle) ====
-These multiwords consist of a verb and a particle, which do not stand next to each other in the sentence. What can stand between the verb and the particle depends on the language, some are more strict than others. For some, phrase-based rules are necessary, others have position-based rules.
+These multiwords consist of a verb and a particle, which do not stand next to each other in the sentence. These constructs vary wildly across languages, so flexible ways to define them in the dictionary are necessary.
+An example is the english phrasal verb ''to throw sth. away''. With increasing difficulty, sentence structures like the following need to be recognized and generated:
-Example:
-* to make something up
+* ''I throw away the letter.'': This can already be handled, it is the existing type of multiword with inner inflection.
-''dictionary example TBD''
+* ''I throw it away.'' has a pronoun between verb and particle,
+* ''I throw the letter away.'' has a noun phrase, and
+* ''I throw the big red nasty letter from my brother away.'' still a noun phrase, but a far more elaborate one.
+To describe the possible patterns of this example in the dictionary, it needs to be defined what can come between the verb and the particle (in this case, the direct object of the verb, whatever form it takes).
-''stream example TBD''
+Another example are separable verbs, which are written as one word in their non-finite forms, but separate in their finite ones, like german ''ankommen'' ("to arrive"):
-==== Ambiguity ====
+* ''Ich komme an.'' ("I arrive.")
+* ''Ich komme am Bahnhof an.'' ("I arrive at the station.")
+* ''Ich komme morgen abend am Bahnhof an.'' ("I arrive at the station tomorrow evening.")
+* ''Ich komme, wenn nichts dazwischen kommt, morgen abend am Bahnhof an.'' ("If nothing unexpected happens, I arrive at the station tomorrow evening.")
+The difference to the example above is that the particle always stands at the end of a clause, even if whole subordinate clauses are inserted, and that it is less important what kinds of phrases stand between verb and particle.
-It is possible to construct many ambiguous cases with multiwords, especially with those of the discontiguous type.
+As discontiguous multiwords are much more varied than the type of complex multiwords described above, I would like to restrict their treatment in this project to phrasal like ''to throw away'' in simple sentences with either a pronoun or a simple noun phrase between the verb and the particle. After the project it can be evaluated if this approach is usable and extendable to more complicated cases or not.
+==== Ambiguity ====
+It is also possible to construct many ambiguous cases with discontiguous multiwords.
 As an example, consider the following sentences:
@@ Line 137: / Line 146: @@
 Also, it might be best, for the more difficult cases, to only attempt to recognize, but not correctly generate them, so not all transformations will be bidirectional.
-Cases (2) to (4) are most likely also beyond the scope of this project, but may be worked upon at the end if there's time left.type.
+Cases (2) to (4) are most likely also beyond the scope of this project, but may be worked upon at the end if there's time left.
 ==== Additional types of multiwords ====
-* cases where a particle verb in some cases needs to be reorded and the particle comes first (because the finite verb needs to be in sentence-second position)
+* cases where a particle verb in some cases needs to be reordered and the particle comes first (because the finite verb needs to be in sentence-second position)
-* separable verbs
 * multiwords that are complex and discontiguous at the same time
@@ Line 151: / Line 158: @@
 It is desired to extend the multiword module in the future so that these and other additional types of multiwords can be handled. They are beyond the main scope of this project, though. Some of them might be included if there's time left at the end.
+==== Implementation ====
-==== Additional types, not covered unless there's extra time ====
+Formally, a finite state transducer can't handle complex multiwords as defined above, because to find out whether the two or more analysed word forms agree in the specified categories, it would need to keep track of the first match, and back-reference it in the second. This is outside regular languages (cf. [http://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages Wikipedia on regular expressions]). Explicitly listing all possible inflected forms works just fine of course, and if this is not desired in the dictionary definition itself, multiword definitions can be expanded automatically.
-(type d:
-* any combination of the above
-* ambiguous cases (the man threw off the dog who bites his hand off -> the man threw the dog, who bites his hand off, off. <- nesting, the man threw the dog, who bites his hand, off. <- commas, the man threw the dog biting his hand off. <- no way
+Discontiguous multiwords can be described by regular expressions, so it should be possible to describe and handle them with the existing FST libraries used in apertium.
-* recognize first and second, recognize third but ignore ambiguity,
-* generate none of these
+Conceptually, the multiword module could fit well into <tt>lttoolbox</tt>, and share code for handling dictionaries and processing the data stream.
-* two or more inflected words which do not agree with each other (french passé composé))
+==== Integration into the apertium pipeline ====
+Analysis of these two new types of multiwords should happen in a separate module. Initially I would like to run it after the POS tagger so that it can work on a disambiguated data stream. It is possible, however, that in some cases the POS tagger destroys a multiword by tagging one of its constituent words incorrectly. Because of this it might be desirable, at some point, to extend the multiword module and have it work on the data stream that still contains ambiguous analyses from the morphological analyzer. This is also most likely beyond the scope of this project.
-The multiword module will be a separate tool that can be run for languages that need it, and be left out of others, at the discretion of the language pair maintainer.
+=== Timeline ===
-I would like to start with analysing these types of multiwords in the
-disambiguated data stream, i.e. after apertium-tagger has run. There is the possibility that the POS tagger destroys a multiword by
-assigning any of its constituent words to a wrong category / part of speech.
-That I have not found a good example for it does not mean there is none.
-However, for the sake of simplicity, I would like to start with the
-disambiguated stream. Also, some constructions can be analysed by the
-multiword module in different ways. I would like to start with just offering
-the "best bet", but later add a way to output several possible analyses,
-and leave it to a later module to decide between them.
+All tasks include regression tests and user-level documentation in the form of cookbook style examples in the wiki.
-I would like to take the multiword definitions out of the monolingual dictionaries and put them into a separate dictionary.
+* Now: read code, work on any language pair (en-de because I know it, nl-de was suggested on IRC) to get acquainted with the system and the work of a language pair maintainer, research regular expression engines
-=== Timeline ===
-* Now: read code, work on any language pair (en-de because I know it, nl-de was suggested on IRC) to get acquainted with the system and the work of a language pair maintainer.
+* Community bonding phase: Discuss and define format of the multiword dictionary, turn examples into test cases for regression tests
+* Week 1 and 2: Create new tool, parse dictionary.
+* Week 3 and 4: read disambiguated stream with help of existing libraries
-* Community bonding phase: Define format of the multiword dictionary
+* '''Deliverable #1: working binary that can parse the multiword dictionary and read in a disambiguated data stream'''
+* Week 5 and 6: recognize and generate complex multiwords of type adj-noun
-* Week 1: Create new tool (multiword-transfer?), parse dictionary.
+* Week 7 and 8: recognize discontiguous multiword in unambiguous simple sentences
-* Week 2: read disambiguated stream with help of existing libraries
-* Week 3: recognize and generate multiwords of type adj-noun
-* Week 4: recognize and generate multiwords of type etre invitee
-* '''Deliverable #1: working binary that can analyse and generate multiwords of type A'''
+* '''Deliverable #2: working binary that can recognize and generate complex multiwords of type adj-noun and recognize discontiguous multiwords in unambiguous simple sentences'''
-* Week 5 and 6: recognize and generate multiwords of type koma fra and jmenovat se
+* Week 9 and 10: generate discontiguous multiwords in simple sentences
+* Week 11 and 12: clean up and prepare for release
-* Week 7 and 8: recognize and generate particle verbs and separable verbs with single words between their parts
+* '''Project completed'''
-* '''Deliverable #2: working binary that can analyse reordering and separating multiwords'''
+Additional tasks, if there is time left at the end of the coding phase:
-* Week 9: recognize particle verbs with arbitrarily long passages between verb and particle
-* Week 10: generate these sentences
-* Week 11: work on corner cases, nested expressions and ambiguous cases
-* Week 12: final clean-up and release preparation
+* handle ambiguous cases of discontiguous multiwords
-* '''Project completed: analyse multiwords of type a, b, c, generate sentences with multiwords of type a, b, and simplified c'''
+* optionally process streams with ambiguous analyses still present
+* handle additional types of multiwords
+* work with language maintainers to use the new multiword types in existing language pairs
 === Reasons why Google and Apertium should sponsor it ===

Difference between revisions of "User:Skh/Application GSoC 2010"

Latest revision as of 19:57, 8 April 2010

Contents