Google Summer of Code/Report 2010
Tokenisation with HFST (aikoniv)
Mentored by: Tommi Pirinen, Francis Tyers, and Kevin Unhammer.
This project aimed to allow morphological transducers usable with the HFST tools to be integrated into Apertium translation pipelines. The work primarily involved creating an "hfst-proc" tool providing transducer lookup functionality mirroring that of Apertium's "lt-proc", but which loads transducers in the formats supported by HFST. The lookup tool tokenises the input text on the fly, looking for longest-match hits, thus allowing for multi-word lookups. It is fully aware of the Apertium stream format and handles it the same way that lt-proc does.
Additional work included further development on HFST tools allowing for conversion between the HFST backend formats, and initial development on a tool for dumping a list of the strings recognised by an HFST transducer.
The scope of the GSoC proposal turned out to be too conservative, as the proposed work was completed around the middle of the program, but much additional related work was accomplished over the rest of the summer.
This project was mentored by Jim O´Regan and Petr Homola and was worked on by Asia.
apertium-pl-cs has advanced a reasonable amount during the project. The transfer lexicon contains around 10,000 items, which are also reflected in the morphological analysers/generators. The pair is almost testvoc clean -- it is missing some dictionary entries, and some rules. Some rules have been worked on, but are in an incomplete state.
The pair used a number of scripts to produce morphological analysers for apertium from existing morphological analysers for Czech and Polish, along with a lot of manual work. Some transfer rules were added for different phenomena between Polish and Czech.
The GSOC week plan was not stuck to, and some parts were not completed. The dictionaries are not consistent, the transfer rules are not sufficiently advanced, and the system is not in a releasable state. No evaluation has been performed. However, sufficient progress has been made in order to gain a pass grade as the amount of work to bring the system to release and evaluate it would be a couple of weeks.
This project was mentored by Francis Tyers and Gema Ramírez Sánchez and was worked on by Sean Healy.
apertium-fr-pt has advanced a reasonable amount during the project. The transfer lexicon contains around 16,000 items, which are also reflected in the morphological analysers/generators. The pair is testvoc clean, but does not yet pass a corpus check due to missing rules -- for example for verbal participles. Some rules have been worked on, but are in an incomplete state.
The pair used apertium-dixtools to produce a transfer lexicon from the apertium-fr-es (French to Spanish) and apertium-es-pt (Portuguese to Spanish) pairs. This transfer lexicon was then reviewed and fixed. The transfer rules from apertium-fr-es pair were copied and the Spanish side "translated" to Portuguese. This gives an adequate basis for further work. Some extra rules were added for common patterns missing between French and Portuguese.
The GSOC week plan was not stuck to, and some parts were not completed. The transfer rules are not sufficiently advanced, and the system is not in a releasable state. No evaluation has been performed. However, sufficient progress has been made in order to gain a pass grade as the amount of work to bring the system to release and evaluate it would be under a week.
This project was mentored by Francis Tyers and was worked on by Tihomir Rangelov.
Before the project began, some work had been done on all three dictionaries, but much of it had to be revised, especially the existing verb paradigms for Bulgarian. Much of the work on the bilingual dictionary was done by hand, except for some proper names and some nouns, which were generated semi-automatically. The monolingual dictionaries also required a lot of manual work, although more entries could be generated automatically for them. The parallel Bulgarian-Macedonian SETimes corpus was used extensively for frequency lists and for lookups for common translations or automatic generation of entries.
The current version 0.2.0 has 8,693 entries in the Macedonian monolingual dictionary, 8,467 entries in the Bulgarian monolingual dictionary and 8.743 entries in the bilingual dictionary. It also has 33 rules for syntactic transfer from Macedonian to Bulgarian and 25 transfer rules for the other direction. As part of the project a Constraint Grammar was also written for Macedonian (currently with 41 rules) and one for Bulgarian was started (currently only four rules).
Evaluation was made for the current version 0.2.0 against 60 sentences (1154 words) from different articles in the Macedonian SE- Times corpus. The WER and PWER rates were 21.32% and 18.98% respectively, which is acceptable but worse than Google Translate's 10.66% and 8.51% respectively. These figures are only for translation from Macedonian to Bulgarian. Evaluation for translation in the other direction has not been performed yet. Besides, it is worth mentioning that another evaluation against a "fair" corpus might have to be made, as it appears as though Google Translate uses parts of the SETimes corpus to train its application.
Java runtime port (Kanmuri)
The core components of the runtime have been ported over to Java, with all the components scheduled in the proposal completed. This includes a basic text de/reformatter. Even some of the "extra tasks" were completed at the end. The code also uses UTF-8 throughout, assumes UTF-8, and if unable to use UTF-8 for whatever reason, will complain and then die.
This project built on the partially-completed Lttoolbox-java project from GSoC 2009, and the existing Java bytecode transfer code. The former involved cleaning up some of the existing code, getting the tagger into a runnable state, and squashing bugs in it. It was decided, however, that tagger training was low priority for the Java port, so the tagger training code was left unfinished.
The Java runtime can accept mode files created for the C++ runtime and use its components instead. Instead of using temp files and pipes to pass text between components, the runtime in what I call "pipeline" mode simply stores the intermediate results in memory and passes them between components as strings, though wrapped up in Readers and Writers. This way the same code can handle standard i/o, files, and in-memory strings. This also makes it easier to use Apertium as a library for other code, since that code can just pass in strings and get strings back out. It also eliminates the mojibake (garbled character encoding) issues that were encountered during development before the switch from pure byte streams to readers and writers (which handle character encoding for you).
The Java runtime also uses bytecode class files for transfer, instead of parsing XML files on each run. These class files can be pre-compiled or compiled on demand, but the latter requires that the runtime be run inside a JDK JVM and not a JRE one.
This was built upon the existing bytecode transfer code, and was expaned, with Jacob's help, to be used for all stages of transfer (pretransfer, interchunk, postchunk) in the Java runtime. Previously the transfer bytecode files had to be pre-compiled, but I added dynamic compilation code which, if it cannot find the compiled transfer files in the same directory as the source XML files, or in the system temp directory, will then attempt to compile them on the fly and place them in one of the previously mentioned directories, with the system temp directory being the fallback.
Also, a layer was created to funnel all file opening calls through, to simplify cross-platform file and pathname issues. Especially in the case of Cygwin on Windows being used for the C++ runtime, and the Java runtime being fed things such as mode files from it. The Cygwin environment would use paths rooted in the Cygwin environment, which the JVM couldn't automatically translate, as it used standard Windows paths. The "cygpath" utility (part of the Cygwin runtime) was used to do translation work between the cygwin paths and the underlying Windows paths, and the conversion process is done behind the scenes w/o the runtime needing to care if it's a unix, cygwin, or Windows path.
For the de/reformatter, I created a standardized interface for formatters using an abstract class that be easily extended for other formats such as HTML. However, the architecture of it may need to be re-examined if it is determined that binary formats that need to be unpacked and then repacked by the formatter are incompatible with the current approach, as it was built with text files in mind.
Things that need to be worked on still include:
- Getting it packageable and packaged
- De/reformatters for other formats, especially HTML
- Cleaning up exception handling, including (re)moving many System.exit() calls
- Tagger training (if there's a call for it)
VM for transfer (darthxaher)
The main goal of this project was to speed up Apertium's transfer system. Currently the transfer system becomes the main bottleneck in case of language pair with complex transfer systems because of the XML processing associated with it. The basic idea was:
- Writing a compiler that generates pseudo-assembly like code from the XML files
- Writing a virtual machine that can read instructions from the pseudo-assembly file and run Apertium's usual transfer mechanism.
The compiler was written in python, most of the tags from transfer XML files have been converted to instructions sets, although some case related instructions e.g. modify-case, case-of have not been properly implemented, other than that the compiler is in pretty stable state. A 500 KB t1x transfer file takes 10 secs to be compiled into a pseudo-assembly format. Generally this is not a headache as this is just a pre-processing and does not affect the actual runtime. Some refracting/maintenance is in progress to fix the remaining small issues.
Very preliminary work has been done in case of the Virtual Machine. It would need some more significant work to make the vm a full fledged one. Suffice the say, the deadline of the GSoC was not maintained throughout.
Easy dictionary maintenance (AlessioJr)
This project was mentored by Mikel L. Forcada and carried out by Alessio.
Alessio had the difficult task of creating a dictionary management program for Apertium. Apertium has very heterogenous dictionaries, based on XML. A lot of work was done, to import and export the dictionaries from a database, which was chosen for speed. Unfortunately because of the difficulty of the task, despite all the work put in, there is not a releasable system.
The work merits a pass mark for the difficulty of the task and the amount of work put in.
Pre and post-editing environment (unaszole)
This project was carried out by Arnaud Vié while Luis Villarejo, Jimmy O'Reagan and Mireia Farrús did mentoring tasks.
While the Apertium machine translation engine is becoming more and more accurate in its translations, it still lacks one important aspect of machine translation : the ability to get feedback on the quality of the translation and how to make it better. Therefore, the initial aim of the project was to build a post-editing environment in which users may correct the translation they obtained, and that the changes they made may be logged by the system to benefit from that user's feedback. However, the old web interface was fundamentally inadequate for such a task, as it took care of the whole translation process at once. Here, it was necessary to pause before actually rebuilding the document, to make it possible for the user to check for mistakes and correct them in the translation output - everything in the web interface so that it could log the edits. That's why it proved necessary to rebuild a whole new interface, in which more tools could be included in the end. Thus, the main goal of this project was to construct a pre and post-editing environment for Apertium. We broke down the project into several sub-goals:
- Post-editing interface integrated with Apertium translation toolbox. Accomplished. Copy & paste functionality only working with Firefox.
- Spell checking on source and target languages. Integration with Aspell accomplished.
- Grammar checking on source and target languages. Integration with LanguageTool accomplished.
- Word translation using external dictionaries. Integration with several external dictionaries accomplished.
- Search & replace functionalities on source and target languages. Accomplished. Replace function works in 'case sensitive', 'case insensitive' and 'apply source case' modes.
- Ability to deal with formatted text. While you edit the text, you will notice there are some "empty" characters that you can't delete : those contain the formatting information of the original document, that you cannot edit during this translation process. Accomplished for OpenOffice formats.
- Logging system. Accomplished. All events are logged as they happen, ie at the very moment the user inserts or deletes text. This allows for a further data mining on the edits to detect commonly modified structures in a given translation pair.
- Translation memory generation. Integration of Maligna accomplished.
Multiword handling (skh)
This project had two goals:
- A module to handle contiguous multiwords which have agreement, turning e.g.
^word1word2<sg>$before transfer. These may be technically possible to write as multiwords in the dictionary, but each entry would require a new paradigm, creating a lot of extra work and redundancy.
- A module to handle discontiguous multiwords (such as German particle verbs), turning e.g.
^foo$ ^bar$ ^word2<part>$into
^word1word2<vblex>$ ^foo$ ^bar$.
For the first problem, a tool
("multiword-preprocessor") was written which turns a set of lemmas and
multiword templates (in an XML format) into entries in the dix format
which would be too numerous to specify by hand, thus saving the
dictionary writer some work. A template may specify that the two words
have to agree on case and number, and outputs the cartesian product of
all possible combinations in dix-format.
For the second problem, a module
written. This is given a list of possible discontiguous multiwords (in
an XML format); on reading a possible "first-part", it starts
buffering words, and if it sees the "second-part" of that multiword it
will output the multiword and then the buffer; otherwise the buffer is
output on seeing an end-of-sentence. It also allows for specification
of what types of part of speech may appear between the first and
Some work still needs to be done. The agreement multiword preprocessor relies on the generator to find all the possible surface forms in the analyser -- due to LR restrictions these may not always exist in the generator. It also should merge the dictionaries, to make it easier on the language pair maintainer. The discontiguous multiword module, as it is meant to be used as a runtime module, should ideally compile the XML multiword dictionary into an FST. It also needs more extensive testing to ensure that formatting is handled correctly. The schedule was not followed completely, and some parts were not completed -- generation of discontiguous multiwords was not touched upon, and the specification for the first module changed a lot. However, the discontiguous multiword analyser should not require that much work before a release.