Difference between revisions of "Frankier/GSOC 2016 submission"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:


These are both available here: https://github.com/frankier/apertium-core/tree/apertium-tagger-dev . Currently the perceptron tagger is on top of the incomplete CG work. Probably the best thing is for me to delete the commits which are only related to the CG work and include the common work when merging the perceptron work. It would be good to get the CG part working at a later stage since it seems like at very least if it could be integrated into tagging it would be useful.
These are both available here: https://github.com/frankier/apertium-core/tree/apertium-tagger-dev . Currently the perceptron tagger is on top of the incomplete CG work. Probably the best thing is for me to delete the commits which are only related to the CG work and include the common work when merging the perceptron work. It would be good to get the CG part working at a later stage since it seems like at very least if it could be integrated into tagging it would be useful.

The numerical results for the perceptron tagger are available on [[comparison of part-of-speech tagging systems]]. Currently a small but definite improvement over the bigram tagger.


== Supporting work ==
== Supporting work ==

Revision as of 19:00, 22 August 2016

This page is to document the work I have done related to Apertium during GSOC 2016. It includes work done directly on the actual project goals as well as some bits of ancillary work.

Main work

Work on better integrating CG into apertium-tagger - incomplete - unmerged.

Work on perceptron tagger - basically complete although there are possible improvements to be made - unmerged.

These are both available here: https://github.com/frankier/apertium-core/tree/apertium-tagger-dev . Currently the perceptron tagger is on top of the incomplete CG work. Probably the best thing is for me to delete the commits which are only related to the CG work and include the common work when merging the perceptron work. It would be good to get the CG part working at a later stage since it seems like at very least if it could be integrated into tagging it would be useful.

The numerical results for the perceptron tagger are available on comparison of part-of-speech tagging systems. Currently a small but definite improvement over the bigram tagger.

Supporting work

Some bug fixes and refactoring not directly related to the project have already made their way into trunk. Also when lttoolbox needed to be changed I just changed it directly.

I've put various bits of scratch code here: https://github.com/frankier/apertiumhmm2dot (this repository also contains the coding challenge). This might be useful to other people wanting to get started with working on the tagger in future.

Had an idea of fixing of out sync corpora automatically and started an "MVP" here: https://github.com/frankier/apertium-sync-corpus

Set up Jenkins here: http://swobu.frankie.robertson.name:49001 . During the project I mainly used this to run my own stuff, but this is generally useful for collaborative. For example it could be used to automatically run Lint and poke relevant people on IRC about it. Another application is for people to be able to see at a glance the quality of a language pair in terms of it being easy to adopt (rather than the quality of its output). It could be used to help keep corpora, tagger models and morphologies in sync (though poking and possible automatic fixing when feasible). It can be moved to another place easily by setting up this Docker image: https://hub.docker.com/r/frankierr/docker-jenkins-apertium/ and rsync'ing its workspace (which is bind mounted in Docker).