Ideas for Google Summer of Code/lint for Apertium

Make a program which tests Apertium data files for suspicious or unrecommended constructs (likely to be bugs). Sometimes when several people are working on the same code, things can get repeated, or beginners can make unrecommended changes. A lint tester would help people write standard code for dictionaries and transfer files.

The lint tool should at least check lttoolbox (dix) dictionary data, perhaps also transfer rules. The Apertium New Language Pair HOWTO should introduce most of the terminology and background you need (but Monodix basics and Contributing to an existing pair may also be useful).

Tasks

Make a study of "dangerous" and "unrecommended" constructs for the different kinds of Apertium data:
- Dictionary files
- Transfer files
- Tagger files
- Modes files
Make programs to detect these problems and report them to the user.

Coding challenge

Write a program which parses a .dix file and for each (surface form, lexical form) pair, lists entries/paradigms which generate this pair.

Examples

Redundant entries: There may be two (or more) entries in a monolingual dictionary which generate the same lexical forms. It is more usual to find an entry which generates a subset of the lexical forms generated by another entry. For instance, a few weeks ago we found the entries:

<e lm="soleado"><i>solead</i><par n="absolut/o__adj"/></e>

<e lm="soleado" a="prompsit"><i>solead</i><par n="abstract/o__adj"/></e>

The first one generates all the forms of the adjective for masculine/feminine and singular/plural. The second one generates the same forms but, in addition, it generates the superlative forms. Obviously, some redundancy exists in this case, so it may be interesting to detect this phenomena to choose the correct entry.

Repeated tags

A user attempting to write …<s n="n"/><s n="nt"/>… might miss the last "t" and end up with …<s n="n"/><s n="n"/>…. Repeated tags should be fairly easy to detect.

Tag order

We can assume some apertium-wide standards, like:

the first tag after a lemma is one of prn/vblex/n/num/adj/det/…
number never before gender
art (det/ind) never before number/gender (exception: ind/def-determiners)
qnt/pos/dem are always before gender/art/number

But tag order can quickly get a bit language pair specific – the final tool might accept an optional config file.

Full lemmas in entries where part of the lemma is specified by the pardef

One common error when you have a pardef that defines part of the lemma, is to write that part twice (once in the pardef, once in the entry using the pardef), e.g.

<pardef n="enk/e__n">
 <e><p><l>a</l> <r>e<s n="n"/><s n="f"/><s n="sg"/><s n="def"/></r></p></e>
 <e><p><l>e</l> <r>e<s n="n"/><s n="f"/><s n="sg"/><s n="ind"/></r></p></e>
 …
</pardef>
…
<e lm="slette"><i>slette</i><par n="enk/e__n"></e>

Here the correct entry should be

<e lm="slette"><i>slett</i><par n="enk/e__n"></e>

(The pardef name shows that the "e" is defined in the pardef, but lint shouldn't rely on that.)

Plain space in l/r/i

Should be <b/>

Duplicate pardefs

(This typically gets a warning on compile as well.)

Unused pardefs

Odd tag combinations

E.g. "adj.n" or "n.ij" within one entry – typically wrong.

Also, bidix entries that translate n into ij or adj into det (though here we might want to be able to suppress some combinations in a config file in the language pair).

Non-existing attributes

In transfer files, one common error is calling, for instance in <clip> an attribute in part="" that does not exist.

Frequently asked questions

none yet, ask us something! :)

Ideas for Google Summer of Code/lint for Apertium

Contents

Tasks

Coding challenge

Examples

Frequently asked questions

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools