Difference between revisions of "Google Summer of Code/Report 2010"
(→Java runtime port (Kanmuri): First (and final?) draft of my report.) |
Darthxaher (talk | contribs) |
||
Line 32: | Line 32: | ||
==VM for transfer (darthxaher)== |
==VM for transfer (darthxaher)== |
||
The main goal of this project was to speed up Apertium's transfer system. Currently the transfer system becomes the main bottleneck in case of language pair with complex transfer systems because of the XML processing associated with it. The basic idea was: |
|||
* Writing a compiler that generates pseudo-assembly like code from the XML files |
|||
* Writing a virtual machine that can read instructions from the pseudo-assembly file and run Apertium's usual transfer mechanism. |
|||
The compiler was written in python, most of the tags from transfer XML files have been converted to instructions sets, although some case related instructions e.g. modify-case, case-of |
|||
have not been properly implemented, other than that the compiler is in pretty stable state. A 500 KB t1x transfer file takes 10 secs to be compiled into a pseudo-assembly format. Generally this is not a headache as this is just a pre-processing and does not affect the actual runtime. Some refracting/maintenance is in progress to fix the remaining small issues. |
|||
Very preliminary work has been done in case of the Virtual Machine. It would need some more significant work to make the vm a full fledged one. Suffice the say, the deadline of the GSoC was not maintained throughout. |
|||
==Easy dictionary maintenance (AlessioJr)== |
==Easy dictionary maintenance (AlessioJr)== |
Revision as of 12:28, 22 October 2010
Tokenisation with HFST (aikoniv)
apertium-pl-cs (Aha)
apertium-fr-pt (Jalopuera)
This project was mentored by Francis Tyers and Gema Ramírez Sánchez and was worked on by Sean Healy.
apertium-fr-pt has advanced a reasonable amount during the project. The transfer lexicon contains around 16,000 items, which are also reflected in the morphological analysers/generators. The pair is testvoc clean, but does not yet pass a corpus check due to missing rules -- for example for verbal participles. Some rules have been worked on, but are in an incomplete state.
The pair used apertium-dixtools to produce a transfer lexicon from the apertium-fr-es (French to Spanish) and apertium-es-pt (Portuguese to Spanish) pairs. This transfer lexicon was then reviewed and fixed. The transfer rules from apertium-fr-es pair were copied and the Spanish side "translated" to Portuguese. This gives an adequate basis for further work. Some extra rules were added for common patterns missing between French and Portuguese.
The GSOC week plan was not stuck to, and some parts were not completed. The transfer rules are not sufficiently advanced, and the system is not in a releasable state. No evaluation has been performed. However, sufficient progress has been made in order to gain a pass grade as the amount of work to bring the system to release and evaluate it would be under a week.
apertium-fin-sme (pyry`)
Java runtime port (Kanmuri)
The core components of the runtime have been ported over to Java, with all the components scheduled in the proposal completed. This includes a basic text de/reformatter. Even some of the "extra tasks" were completed at the end.
The Java runtime can accept mode files created for the C++ runtime and use its components instead. Instead of using temp files and pipes to pass text between components, the runtime in what I call "pipeline" mode simply stores the intermediate results in memory and passes them between components as strings, though wrapped up in Readers and Writers. This way the same code can handle standard i/o, files, and in-memory strings. This also makes it easier to use Apertium as a library for other code, since that code can just pass in strings and get strings back out.
The Java runtime also uses bytecode class files for transfer, instead of parsing XML files on each run. These class files can be pre-compiled or compiled on demand, but the latter requires that the runtime be run inside a JDK JVM and not a JRE one.
Also, a layer was created to funnel all file opening calls through, to simplify cross-platform file and pathname issues. Especially in the case of Cygwin on Windows being used for the C++ runtime, and the Java runtime being fed things such as mode files from it. The Cygwin environment would use paths rooted in the Cygwin environment, which the JVM couldn't automatically translate, as it used standard Windows paths. The "cygpath" utility (part of the Cygwin runtime) was used to do translation work between the cygwin paths and the underlying Windows paths, and the conversion process is done behind the scenes w/o the runtime needing to care if it's a unix, cygwin, or Windows path.
Things that need to be worked on still include:
- Getting it packageable and packaged.
- De/reformatters for other formats, especially HTML
- Cleaning up exception handling, including (re)moving many System.exit() calls.
VM for transfer (darthxaher)
The main goal of this project was to speed up Apertium's transfer system. Currently the transfer system becomes the main bottleneck in case of language pair with complex transfer systems because of the XML processing associated with it. The basic idea was:
- Writing a compiler that generates pseudo-assembly like code from the XML files
- Writing a virtual machine that can read instructions from the pseudo-assembly file and run Apertium's usual transfer mechanism.
The compiler was written in python, most of the tags from transfer XML files have been converted to instructions sets, although some case related instructions e.g. modify-case, case-of have not been properly implemented, other than that the compiler is in pretty stable state. A 500 KB t1x transfer file takes 10 secs to be compiled into a pseudo-assembly format. Generally this is not a headache as this is just a pre-processing and does not affect the actual runtime. Some refracting/maintenance is in progress to fix the remaining small issues.
Very preliminary work has been done in case of the Virtual Machine. It would need some more significant work to make the vm a full fledged one. Suffice the say, the deadline of the GSoC was not maintained throughout.