UD annotatrix/UD annotatrix at GSoC 2017
Contents
Commitment
The contributions were made to the master branch of the UD annotatrix repository on GitHub (last commit).
The idea
Dependency treebank is a corpus of sentences with annotated dependency structure. It can be used both for the purposes of linguistic research and for training statistical parser, which in turn can serve different purposes of natural language processing. For creating a good treebank, manual annotation and/or disambiguation is required. There is a tool for doing syntactic annotation called brat. However, the tool has a number of issues. Firstly, it does not allow a user to edit the source. Secondly, it does not allow to edit tokenisation. Basically, this interface lacks a lot of features that could be very useful for annotation. Finally, it requires a web-server in order to be used by a team of annotators.
Before GSoC 2017 Apertium had a web-interface for visualising syntactic trees. The interface allowed the user to either enter their trees in the text area or upload a treebank from a file and switch between sentences.
The aim of this project was to create an easy-to-use, quick and interactive interface tool for Universal Dependencies annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes.
Main contributions
Visualisation
Primarily, the tool was using brat's JavaScript library for visualisation. As a part of this project, I have rewritten the visualisation part using the Cytoscape library. Cytoscape is a JS graph library primarily developed for biologists, but avaliable to use for different purposes. This was made to add functionality which brat's visualisation library could not provide, namely, easier implementation of editing functionality and alignment (right-to-left, top to bottom) settings.
The source code for visualisation support is located in ./standalone/lib/visualiser.js and ./standalone/lib/cy-style.js.
Editing functionality
Currently, the interface allows to:
- draw depencencies between tokens
- edit dependency relations
- delete dependencies
- edit POS labels
- edit tokens
Editing POS labels, editing deprels, drawing arcs and deleting arcs are undoable and redoable.
The source code for editing support is mostly located in ./standalone/lib/gui.js.
Format conversion
The interface allows to work with CoNLL-U and CG3 formats, and to convert the data between the formats. It also allows to either upload or paste corpora in plain text and then convert them into CoNLL-U. The source code for conversion support is located in ./standalone/lib/coverters.js and ./standalone/lib/CG2conllu.js.
Server version
There is also a ...
The product
The web-interface is currently available on GitHub pages. The basic manual to the interface is provided on the help page. The editing functionality is briefly described in Editing functionality.
The product has some functionality not present in other tools:
- It supports multiword tokens visualisation
- It supports right-to-left alignment (e.g., for sentences in Arabic or Hebrew)
- It supports vertical alignment, which makes editing of long sentences more convenient
The project's architecture and components
Dependencies
All the JS dependencies needed for the standalone version are included in the package. These are:
- jQuery
- Cytoscape
- head.js
- a JS library for parsing conllu written by Magdalena Parks.
The server version is written on Python 3, Flask.
The project's package consists of sever and standalone sub-directories.
Standalone
The standalone sub-directory contains the version of the product which can function without the server package. This directory contains the dependencies (listed above, located in directory ./ext) and the main native code of the project. The native code consists of:
- annotator.js
Server
The server directory contains additional support ...
Usability testing
To evaluate the usability of the interface, a small Polish treebank was annotated. The language data is an an extract from the Polish translation of "Three Men in a Boat".
To be done
All the existing bugs and plans are listed on the issue page of the main repository. The main directions of the further development could be:
- Improving the server version. Currently the only thing server version allows to do is saving user corpora on server. More advanced functionality could be:
- The ability to create user accounts and have "projects" with a number of uploaded corpora
- Editing history managing
- Improving the GUI functionality
- More on format conversion
- SD parse
- Work on style
- Improving style of the website
- Adding user-selected style settings