Frankier/GSOC 2016 submission

From Apertium
Revision as of 19:04, 22 August 2016 by Frankier (talk | contribs)
Jump to navigation Jump to search

This page is to document the work I have done related to Apertium during GSOC 2016. It includes work done directly on the actual project goals as well as some bits of ancillary work.

Main work

Work on better integrating CG into apertium-tagger - incomplete - unmerged.

Work on perceptron tagger - basically complete although there are possible improvements to be made - unmerged.

These are both available here:

The CG work is (mostly) in this commit:

The perceptron work is in the subsequent commits, that is 46cf4fb15e4fb64d967a4012837c61412e1bbb64 to ae86f0700f8f33e802a320614613b47157d440df.

Note that currently the perceptron tagger is on top of the incomplete CG work. Probably the best thing is for me to delete the commits which are only related to the CG work and include the common work when merging the perceptron work. It would be good to get the CG part working at a later stage since it seems like at very least if it could be integrated into tagging it would be useful.

The numerical results for the perceptron tagger are available on comparison of part-of-speech tagging systems. Currently there is a small but definite improvement over the bigram tagger.

Supporting work

Some bug fixes and refactoring not directly related to the project have already made their way into trunk. Also when lttoolbox needed to be changed I just changed it directly.

I've put various bits of scratch code here: (this repository also contains the coding challenge). This might be useful to other people wanting to get started with working on the tagger in future.

Had an idea of fixing of out sync corpora automatically and started an "MVP" here:

Set up Jenkins here: . During the project I mainly used this to run my own stuff, but this is generally useful for collaborative development. For example it could be used to automatically run Lint and poke relevant people on IRC about it. Another application is for people to be able to see at a glance the quality of a language pair in terms of it being easy to adopt (rather than the quality of its output). It could be used to help keep corpora, tagger models and morphologies in sync (though poking and possible automatic fixing when feasible). It can be moved to another place easily by setting up this Docker image: and rsync'ing its workspace (which is bind mounted in Docker).