User:Aaron Rubin/Application

From Apertium
< User:Aaron Rubin
Revision as of 20:28, 1 April 2012 by Aaron Rubin (talk | contribs) (Created page with '== Introduction == I am a third-year student at the University of Chicago, majoring in linguistics with a minor in computer science. I speak English, Japanese, Arabic, and Hebr…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

I am a third-year student at the University of Chicago, majoring in linguistics with a minor in computer science. I speak English, Japanese, Arabic, and Hebrew, and I've studied enough Spanish and French to be comfortable with looking at the Apertium dictionary files for those languages. I've been looking for an opportunity to contribute to a machine-translation project, so I was very excited when I saw Apertium listed as one of the participating organizations in Google Summer of Code. In light of the nature of my programming experience (mostly C programs that analyze text files), it seemed like the lint tester for dictionary and transfer rules files would be best suited for me. Once completed, this program will allow Apertium developers to find small errors in enormous files that might otherwise be very difficult to detect and debug.

Discussion of Implementation

I have already completed the "Repeated Tags" aspect of the lint tester. I also have a few ideas for how I will implement the "Redundant Entries" tester and tester for "Full lemmas in entries where part of the lemma is specified by the pardef," both for monolingual dictionaries. Below, I will describe in detail my proposed implementations for those features, as well as a tentative work schedule including a few other suggestions that I have for the lint tester for .dix files and the lint tester for transfer rules.

The Redundant Entry Finder will first check whether the entries within one paradigm are a subset of the entries within another paradigm. To do this, the program will first walk over the <pardefs> section and build a list of pointers to paradigm-structs. A paradigm-struct will contain a string (the name) and a list of entries. An entry is a (pointer to) a structure consisting of:

1. The left side (a string)

2. The right side (a pointer to another structure, containing a string corresponding to the dictionary form and a list of tags).

We will also write a function to compare entries, which will involve a series of string comparisons, and a function to check for subsettedness of paradigms. A paradigm is a subset of another paradigm of it has less entries than the other paradigm and every entry within it is contained within the other paradigm (this is where we would use the compare_entries function).

With these functions available, the program will walk down the list of paradigm-structs, and for each one check whether it is a subset of any of the other paradigm-structs or any of the other paradigm-structs are a subset of it. If so, we would build two string_string structures, one with the name of the subset paradigm and one with the name of the paradigm of which it as a subset as the first string in each. For instance, since "absolut/o_adj" is a subset of "abstract/o_adj" in Spanish, we would build one structure containing "absolut/o_adj" as the first string and "abstract/o_adj" as the second string and one structure in which their positions are reversed. We would insert both of those structures into a hash table consisting of an array of (lists of, to handle collisions) string_strings, with hash values determined by a mathematical transformation of the first string in each, using ASCII values.

Having constructed that hash table, we would move on to processing the entries. We will build a new hash table, this one consisting of an array of (linked lists to) "lemma_paradigm" structs, containing a string (lemma name) and a list of strings (the paradigms associated with that lemma name). We will insert the lemma name and paradigm of each lexical entry into this hash table, and when we encounter a collision, we will check for each lemma_paradigm already in the bucket whether the lemma name matches the lemma name that we are currently inserting - if so, we will simply cons the paradigm to the list of paradigms in that lemma_paradigm. If not (if the collision is due to identical hash values, but a different lemma), we will add the lemma name and paradigm to the list of lemma_paradigms in the bucket.

After all of the entries are processed, we would traverse the hash table and for each lemma_paradigm, check whether paradigms are repeated or whether a paradigm is a subset of another (for which we need only refer to the other hash table built earlier, instead of constantly repeating our subsettedness function). We would then print helpful diagnostic information to the standard output. Conveniently, this would not only handle subsetted paradigms, but also simple duplicate entries by checking for repeated paradigms in the lemma_paradigm structures.

Testing "Full lemmas in entries where part of the lemma is specified by the pardef," would be a somewhat simpler task. The naive approach would be to check whether the entry ends with a sequence contained in the right side of the pardef. However, that obviously turns up false positives in certain theoretically possible cases - for instance, if the indefinite singular feminine form of the word "slette" as used in the example wiki page was actually "slettee" and the stem was "slette." So a more rigorous approach would be to test whether the lemma is identical to the item in the entry and the right side of the paradigm contains any element at all. We would build a hash table of structs consisting of paradigm names and the string within <r></r> (but not the tags). After building that hash table, we would go over the lexical entries and for each entry, compare the lemma with whatever's in <r></r> or . If identical, we would find the entry's paradigm in the hash table and check whether the string on the right side contains any elements. If so, we would print the offending entry to stdout.

I would also check for a few other potential errors in the lint tester for .dix files. Those features, as well as the proposed features of the lint tester for transfer rules, are described in the work plan below (thank you, Kevin, Francis, Jimmy, and everyone else on the mailing list for your suggestions re: transfer rules in particular). In this work plan, primary deliverables are a lint tester for .dix files after Week 5 and a lint tester for transfer rules files at the end of the summer. In all weeks, some time will be spent writing tests to make sure that that week's feature(s) successfully catches errors.

Work Plan

Weeks 1-5, .dix files (this program would check whether the dictionary is monolingual or bilingual, since some of these tests only apply to one of the two):

Week 1: Testing for redundant entries.

Week 2: Testing for full entries in lemmas where part of the lemma is specified by the pardef. Testing for misspelled tags and pardefs

Week 3: Testing for incompatible tags (both masculine and feminine instead of a single combined "mf" tag, the impossible combination of neuter and masculine when noun and masculine is intended). Testing for part-of-speech tag missing on one side of translation equivalents (in bilingual dictionaries, ex. English "house" being tagged as "noun", but Spanish "casa" missing the tag)

Week 4: Testing for missing gender on nouns in gendered languages (in bilingual dictionaries, where direction is from non-gendered to gendered or direction is unspecified and one language is non-gendered).

Week 5: Bundling features together into one program. Re-organizing code, and writing documentation, to make sure that everything is as neat and maintainable as possible. Combining tests from previous weeks into a single testing program so that all features can be tested at once when the code is modified in the future.

Weeks 6-12, transfer rules (a few of these are already caught by the validator, but we might as well cover them anyway, time permitting, and hopefully print more descriptive error messages)

Week 6: Testing for inappropriate uses of <equal>, <begins-with>, <ends-with>, and <let> in transfer rules (equating a tag with a non-empty string literal, etc.)

Week 7: Testing for cases where the user asks for nonexistent tags with lit-tag v="some_tag" (always an error) or for a string literal with lit v="some_string" that is identical to a tag (suspicious and very likely an error, unless within <concat>, which I'd take into account).

Week 8: Testing for undefined tags after attr-item in attribute definitions, probably due to spelling errors. Checking for calls to anything other than a defined attribute, lem, lemh, lemq, whole, or tags after part= in a clip.

Week 9: Testing for patterns that refer to non-existent categories, probably due to spelling errors. Testing for misspelled variables.

Week 10: Testing for an untagged chunk (ex., in the rule "HACE NUM NOM" in apertium-en-es.en-es.t1x, forgetting to give the resulting chunk the tag "adverb," which seems like a conceivable mistake to me). Checking for incorrect number of arguments in calls to macro.

Week 11: Testing for missing <test> after <when> and for non-boolean arguments to <test>, <and>, <not>, and <or>. Testing missing lemma queue after lemma head.

Week 12: Bundling all features together into one program (note that this program would need to take a bilingual dictionary file, in addition to a transfer rules file, as input to determine the set of valid tags). Re-organizing code, and writing documentation, to make sure that everything is as neat and maintainable as possible. Combining tests from previous weeks into a single testing program so that all features can be tested at once when the code is modified in the future.

Availability

With respect to availability, I am currently very open this summer. Since my school is on a quarter instead of a semester schedule, Summer of Code technically begins in the 9th (second to last) week of the quarter, during which I have a full class schedule and will be preparing for finals. However, school also starts later in the fall (September 30th), so I will be able to stay around after Summer of Code is technically over to make sure that the project is complete.

Coding Challenge

The following is a link to the already completed "Repeated Tags" aspect of the lint checker (for .dix files). Compile with make and run with ./repeated_tags; it takes the name of a .dix file as a command line argument: repeated_tags