Google Summer of Code/Report 2010

From Apertium
Jump to navigation Jump to search

Tokenisation with HFST (aikoniv)

apertium-pl-cs (Aha)

apertium-fr-pt (Jalopuera)

This project was mentored by Francis Tyers and Gema Ramírez Sánchez and was worked on by Sean Healy.

apertium-fr-pt has advanced a reasonable amount during the project. The transfer lexicon contains around 16,000 items, which are also reflected in the morphological analysers/generators. The pair is testvoc clean, but does not yet pass a corpus check due to missing rules -- for example for verbal participles. Some rules have been worked on, but are in an incomplete state.

The pair used apertium-dixtools to produce a transfer lexicon from the apertium-fr-es (French to Spanish) and apertium-es-pt (Portuguese to Spanish) pairs. This transfer lexicon was then reviewed and fixed. The transfer rules from apertium-fr-es pair were copied and the Spanish side "translated" to Portuguese. This gives an adequate basis for further work. Some extra rules were added for common patterns missing between French and Portuguese.

The GSOC week plan was not stuck to, and some parts were not completed. The transfer rules are not sufficiently advanced, and the system is not in a releasable state. No evaluation has been performed. However, sufficient progress has been made in order to gain a pass grade as the amount of work to bring the system to release and evaluate it would be under a week.

apertium-fin-sme (pyry`)

Java runtime port (Kanmuri)

The core components of the runtime have been ported over to Java, with all the components scheduled in the proposal completed. This includes a basic text de/reformatter. Even some of the "extra tasks" were completed at the end.

The Java runtime can accept mode files created for the C++ runtime and use its components instead. Instead of using temp files and pipes to pass text between components, the runtime in what I call "pipeline" mode simply stores the intermediate results in memory and passes them between components as strings, though wrapped up in Readers and Writers. This way the same code can handle standard i/o, files, and in-memory strings. This also makes it easier to use Apertium as a library for other code, since that code can just pass in strings and get strings back out.

The Java runtime also uses bytecode class files for transfer, instead of parsing XML files on each run. These class files can be pre-compiled or compiled on demand, but the latter requires that the runtime be run inside a JDK JVM and not a JRE one.

Also, a layer was created to funnel all file opening calls through, to simplify cross-platform file and pathname issues. Especially in the case of Cygwin on Windows being used for the C++ runtime, and the Java runtime being fed things such as mode files from it. The Cygwin environment would use paths rooted in the Cygwin environment, which the JVM couldn't automatically translate, as it used standard Windows paths. The "cygpath" utility (part of the Cygwin runtime) was used to do translation work between the cygwin paths and the underlying Windows paths, and the conversion process is done behind the scenes w/o the runtime needing to care if it's a unix, cygwin, or Windows path.

Things that need to be worked on still include:

  • Getting it packageable and packaged.
  • De/reformatters for other formats, especially HTML
  • Cleaning up exception handling, including (re)moving many System.exit() calls.

VM for transfer (darthxaher)

The main goal of this project was to speed up Apertium's transfer system. Currently the transfer system becomes the main bottleneck in case of language pair with complex transfer systems because of the XML processing associated with it. The basic idea was:

  • Writing a compiler that generates pseudo-assembly like code from the XML files
  • Writing a virtual machine that can read instructions from the pseudo-assembly file and run Apertium's usual transfer mechanism.

The compiler was written in python, most of the tags from transfer XML files have been converted to instructions sets, although some case related instructions e.g. modify-case, case-of have not been properly implemented, other than that the compiler is in pretty stable state. A 500 KB t1x transfer file takes 10 secs to be compiled into a pseudo-assembly format. Generally this is not a headache as this is just a pre-processing and does not affect the actual runtime. Some refracting/maintenance is in progress to fix the remaining small issues.

Very preliminary work has been done in case of the Virtual Machine. It would need some more significant work to make the vm a full fledged one. Suffice the say, the deadline of the GSoC was not maintained throughout.

Easy dictionary maintenance (AlessioJr)

Post-edition tool (unaszole)

Multiword handling (skh)

This project had two goals:

  1. A module to handle contiguous multiwords which have agreement, turning e.g. ^word1<sg>$ ^word2<sg>$ into ^word1word2<sg>$ before transfer. These may be technically possible to write as multiwords in the dictionary, but each entry would require a new paradigm, creating a lot of extra work and redundancy.
  2. A module to handle discontiguous multiwords (such as German particle verbs), turning e.g. ^word1<vblex>$ ^foo$ ^bar$ ^word2<part>$ into ^word1word2<vblex>$ ^foo$ ^bar$.

For the first problem, a tool lt-mwpp ("multiword-preprocessor") was written which turns a set of lemmas and multiword templates (in an XML format) into entries in the dix format which would be too numerous to specify by hand, thus saving the dictionary writer some work. A template may specify that the two words have to agree on case and number, and outputs the cartesian product of all possible combinations in dix-format.

For the second problem, a module multiword-reorder was written. This is given a list of possible discontiguous multiwords (in an XML format); on reading a possible "first-part", it starts buffering words, and if it sees the "second-part" of that multiword it will output the multiword and then the buffer; otherwise the buffer is output on seeing an end-of-sentence. It also allows for specification of what types of part of speech may appear between the first and second part.

Some work still needs to be done. The agreement multiword preprocessor relies on the generator to find all the possible surface forms in the analyser -- due to LR restrictions these may not always exist in the generator. It also should merge the dictionaries, to make it easier on the language pair maintainer. The discontiguous multiword module, as it is meant to be used as a runtime module, should ideally compile the XML multiword dictionary into an FST. It also needs more extensive testing to ensure that formatting is handled correctly. The schedule was not followed completely, and some parts were not completed -- generation of discontiguous multiwords was not touched upon, and the specification for the first module changed a lot. However, the discontiguous multiword analyser should not require that much work before a release.