Difference between revisions of "Google Summer of Code/Report 2010"
| Line 53: | Line 53: | ||
| ==Easy dictionary maintenance (AlessioJr)== | ==Easy dictionary maintenance (AlessioJr)== | ||
| == | ==Pre and post-editing environment (unaszole)== | ||
| This project was carried out by Arnaud Vié while Luis Villarejo, Jimmy O'Reagan and Mireia Farrús did mentoring tasks. <br /><br /> | This project was carried out by Arnaud Vié while Luis Villarejo, Jimmy O'Reagan and Mireia Farrús did mentoring tasks. <br /><br /> | ||
| While the Apertium machine translation engine is becoming more and more accurate in its translations, it still lacks one important aspect of machine translation : the ability to get feedback on the quality of the translation and how to make it better. Therefore, the initial aim of the project was to build a post-editing environment in which users may correct the translation they obtained, and that the changes they made may be logged by the system to benefit from that user's feedback.  | While the Apertium machine translation engine is becoming more and more accurate in its translations, it still lacks one important aspect of machine translation : the ability to get feedback on the quality of the translation and how to make it better. Therefore, the initial aim of the project was to build a post-editing environment in which users may correct the translation they obtained, and that the changes they made may be logged by the system to benefit from that user's feedback.  | ||
Revision as of 00:32, 8 November 2010
Tokenisation with HFST (aikoniv)
apertium-pl-cs (Aha)
apertium-fr-pt (Jalopuera)
This project was mentored by Francis Tyers and Gema Ramírez Sánchez and was worked on by Sean Healy.
apertium-fr-pt has advanced a reasonable amount during the project. The transfer lexicon contains around 16,000 items, which are also reflected in the morphological analysers/generators. The pair is testvoc clean, but does not yet pass a corpus check due to missing rules -- for example for verbal participles. Some rules have been worked on, but are in an incomplete state.
The pair used apertium-dixtools to produce a transfer lexicon from the apertium-fr-es (French to Spanish) and apertium-es-pt (Portuguese to Spanish) pairs. This transfer lexicon was then reviewed and fixed. The transfer rules from apertium-fr-es pair were copied and the Spanish side "translated" to Portuguese. This gives an adequate basis for further work. Some extra rules were added for common patterns missing between French and Portuguese.
The GSOC week plan was not stuck to, and some parts were not completed. The transfer rules are not sufficiently advanced, and the system is not in a releasable state. No evaluation has been performed. However, sufficient progress has been made in order to gain a pass grade as the amount of work to bring the system to release and evaluate it would be under a week.
apertium-mk-bg (tihomir)
apertium-fin-sme (pyry`)
Java runtime port (Kanmuri)
The core components of the runtime have been ported over to Java, with all the components scheduled in the proposal completed. This includes a basic text de/reformatter. Even some of the "extra tasks" were completed at the end. The code also uses UTF-8 throughout, assumes UTF-8, and if unable to use UTF-8 for whatever reason, will complain and then die.
This project built on the partially-completed Lttoolbox-java project from GSoC 2009, and the existing Java bytecode transfer code. The former involved cleaning up some of the existing code, getting the tagger into a runnable state, and squashing bugs in it. It was decided, however, that tagger training was low priority for the Java port, so the tagger training code was left unfinished.
The Java runtime can accept mode files created for the C++ runtime and use its components instead. Instead of using temp files and pipes to pass text between components, the runtime in what I call "pipeline" mode simply stores the intermediate results in memory and passes them between components as strings, though wrapped up in Readers and Writers. This way the same code can handle standard i/o, files, and in-memory strings. This also makes it easier to use Apertium as a library for other code, since that code can just pass in strings and get strings back out. It also eliminates the mojibake (garbled character encoding) issues that were encountered during development before the switch from pure byte streams to readers and writers (which handle character encoding for you).
The Java runtime also uses bytecode class files for transfer, instead of parsing XML files on each run. These class files can be pre-compiled or compiled on demand, but the latter requires that the runtime be run inside a JDK JVM and not a JRE one.
This was built upon the existing bytecode transfer code, and was expaned, with Jacob's help, to be used for all stages of transfer (pretransfer, interchunk, postchunk) in the Java runtime. Previously the transfer bytecode files had to be pre-compiled, but I added dynamic compilation code which, if it cannot find the compiled transfer files in the same directory as the source XML files, or in the system temp directory, will then attempt to compile them on the fly and place them in one of the previously mentioned directories, with the system temp directory being the fallback.
Also, a layer was created to funnel all file opening calls through, to simplify cross-platform file and pathname issues. Especially in the case of Cygwin on Windows being used for the C++ runtime, and the Java runtime being fed things such as mode files from it. The Cygwin environment would use paths rooted in the Cygwin environment, which the JVM couldn't automatically translate, as it used standard Windows paths. The "cygpath" utility (part of the Cygwin runtime) was used to do translation work between the cygwin paths and the underlying Windows paths, and the conversion process is done behind the scenes w/o the runtime needing to care if it's a unix, cygwin, or Windows path.
For the de/reformatter, I created a standardized interface for formatters using an abstract class that be easily extended for other formats such as HTML. However, the architecture of it may need to be re-examined if it is determined that binary formats that need to be unpacked and then repacked by the formatter are incompatible with the current approach, as it was built with text files in mind.
Things that need to be worked on still include:
- Getting it packageable and packaged
- De/reformatters for other formats, especially HTML
- Cleaning up exception handling, including (re)moving many System.exit() calls
- Tagger training (if there's a call for it)
VM for transfer (darthxaher)
The main goal of this project was to speed up Apertium's transfer system. Currently the transfer system becomes the main bottleneck in case of language pair with complex transfer systems because of the XML processing associated with it. The basic idea was:
- Writing a compiler that generates pseudo-assembly like code from the XML files
- Writing a virtual machine that can read instructions from the pseudo-assembly file and run Apertium's usual transfer mechanism.
The compiler was written in python, most of the tags from transfer XML files have been converted to instructions sets, although some case related instructions e.g. modify-case, case-of have not been properly implemented, other than that the compiler is in pretty stable state. A 500 KB t1x transfer file takes 10 secs to be compiled into a pseudo-assembly format. Generally this is not a headache as this is just a pre-processing and does not affect the actual runtime. Some refracting/maintenance is in progress to fix the remaining small issues.
Very preliminary work has been done in case of the Virtual Machine. It would need some more significant work to make the vm a full fledged one. Suffice the say, the deadline of the GSoC was not maintained throughout.
Easy dictionary maintenance (AlessioJr)
Pre and post-editing environment (unaszole)
This project was carried out by Arnaud Vié while Luis Villarejo, Jimmy O'Reagan and Mireia Farrús did mentoring tasks. 
While the Apertium machine translation engine is becoming more and more accurate in its translations, it still lacks one important aspect of machine translation : the ability to get feedback on the quality of the translation and how to make it better. Therefore, the initial aim of the project was to build a post-editing environment in which users may correct the translation they obtained, and that the changes they made may be logged by the system to benefit from that user's feedback. 
However, the old web interface was fundamentally inadequate for such a task, as it took care of the whole translation process at once. Here, it was necessary to pause before actually rebuilding the document, to make it possible for the user to check for mistakes and correct them in the translation output - everything in the web interface so that it could log the edits. That's why it proved necessary to rebuild a whole new interface, in which more tools could be included in the end. Thus, the main goal of this project was to construct a pre and post-editing environment for Apertium. We broke down the project into several sub-goals:
- Post-editing interface integrated with Apertium translation toolbox. Accomplished. Copy & paste functionality only working with Firefox.
- Spell checking on source and target languages. Integration with Aspell accomplished.
- Grammar checking on source and target languages. Integration with LanguageTool accomplished.
- Word translation using external dictionaries. Integration with several external dictionaries accomplished.
- Search & replace functionalities on source and target languages. Accomplished. Replace function works in 'case sensitive', 'case insensitive' and 'apply source case' modes.
- Ability to deal with formatted text. While you edit the text, you will notice there are some "empty" characters that you can't delete : those contain the formatting information of the original document, that you cannot edit during this translation process. Accomplished for OpenOffice formats.
- Logging system. Accomplished. All events are logged as they happen, ie at the very moment the user inserts or deletes text. This allows for a further data mining on the edits to detect commonly modified structures in a given translation pair.
- Translation memory generation. Integration of Maligna accomplished.
Multiword handling (skh)
This project had two goals:
- A module to handle contiguous multiwords which have agreement,   turning e.g. ^word1<sg>$ ^word2<sg>$into^word1word2<sg>$before transfer. These may be technically possible to write as multiwords in the dictionary, but each entry would require a new paradigm, creating a lot of extra work and redundancy.
- A module to handle discontiguous multiwords (such as German   particle verbs), turning e.g. ^word1<vblex>$^foo$ ^bar$ ^word2<part>$into^word1word2<vblex>$ ^foo$ ^bar$.
For the first problem, a tool lt-mwpp
("multiword-preprocessor") was written which turns a set of lemmas and
multiword templates (in an XML format) into entries in the dix format
which would be too numerous to specify by hand, thus saving the
dictionary writer some work. A template may specify that the two words
have to agree on case and number, and outputs the cartesian product of
all possible combinations in dix-format.
For the second problem, a module multiword-reorder was
written. This is given a list of possible discontiguous multiwords (in
an XML format); on reading a possible "first-part", it starts
buffering words, and if it sees the "second-part" of that multiword it
will output the multiword and then the buffer; otherwise the buffer is
output on seeing an end-of-sentence. It also allows for specification
of what types of part of speech may appear between the first and
second part.
Some work still needs to be done. The agreement multiword preprocessor relies on the generator to find all the possible surface forms in the analyser -- due to LR restrictions these may not always exist in the generator. It also should merge the dictionaries, to make it easier on the language pair maintainer. The discontiguous multiword module, as it is meant to be used as a runtime module, should ideally compile the XML multiword dictionary into an FST. It also needs more extensive testing to ensure that formatting is handled correctly. The schedule was not followed completely, and some parts were not completed -- generation of discontiguous multiwords was not touched upon, and the specification for the first module changed a lot. However, the discontiguous multiword analyser should not require that much work before a release.

