Google Summer of Code/Report 2010

From Apertium
< Google Summer of Code
Revision as of 02:09, 21 October 2010 by Kanmuri (talk | contribs) (→‎Java runtime port (Kanmuri): First (and final?) draft of my report.)
Jump to navigation Jump to search

Tokenisation with HFST (aikoniv)

apertium-pl-cs (Aha)

apertium-fr-pt (Jalopuera)

This project was mentored by Francis Tyers and Gema Ramírez Sánchez and was worked on by Sean Healy.

apertium-fr-pt has advanced a reasonable amount during the project. The transfer lexicon contains around 16,000 items, which are also reflected in the morphological analysers/generators. The pair is testvoc clean, but does not yet pass a corpus check due to missing rules -- for example for verbal participles. Some rules have been worked on, but are in an incomplete state.

The pair used apertium-dixtools to produce a transfer lexicon from the apertium-fr-es (French to Spanish) and apertium-es-pt (Portuguese to Spanish) pairs. This transfer lexicon was then reviewed and fixed. The transfer rules from apertium-fr-es pair were copied and the Spanish side "translated" to Portuguese. This gives an adequate basis for further work. Some extra rules were added for common patterns missing between French and Portuguese.

The GSOC week plan was not stuck to, and some parts were not completed. The transfer rules are not sufficiently advanced, and the system is not in a releasable state. No evaluation has been performed. However, sufficient progress has been made in order to gain a pass grade as the amount of work to bring the system to release and evaluate it would be under a week.

apertium-fin-sme (pyry`)

Java runtime port (Kanmuri)

The core components of the runtime have been ported over to Java, with all the components scheduled in the proposal completed. This includes a basic text de/reformatter. Even some of the "extra tasks" were completed at the end.

The Java runtime can accept mode files created for the C++ runtime and use its components instead. Instead of using temp files and pipes to pass text between components, the runtime in what I call "pipeline" mode simply stores the intermediate results in memory and passes them between components as strings, though wrapped up in Readers and Writers. This way the same code can handle standard i/o, files, and in-memory strings. This also makes it easier to use Apertium as a library for other code, since that code can just pass in strings and get strings back out.

The Java runtime also uses bytecode class files for transfer, instead of parsing XML files on each run. These class files can be pre-compiled or compiled on demand, but the latter requires that the runtime be run inside a JDK JVM and not a JRE one.

Also, a layer was created to funnel all file opening calls through, to simplify cross-platform file and pathname issues. Especially in the case of Cygwin on Windows being used for the C++ runtime, and the Java runtime being fed things such as mode files from it. The Cygwin environment would use paths rooted in the Cygwin environment, which the JVM couldn't automatically translate, as it used standard Windows paths. The "cygpath" utility (part of the Cygwin runtime) was used to do translation work between the cygwin paths and the underlying Windows paths, and the conversion process is done behind the scenes w/o the runtime needing to care if it's a unix, cygwin, or Windows path.

Things that need to be worked on still include:

  • Getting it packageable and packaged.
  • De/reformatters for other formats, especially HTML
  • Cleaning up exception handling, including (re)moving many System.exit() calls.

VM for transfer (darthxaher)

Easy dictionary maintenance (AlessioJr)

Post-edition tool (unaszole)

Multiword handling (skh)