UD annotatrix/UD annotatrix at GSoC 2017

From Apertium
Jump to navigation Jump to search

Commitment[edit]

The contributions were made to the master branch of the UD annotatrix repository on GitHub (last commit).

The idea[edit]

Dependency treebank is a corpus of sentences with annotated dependency structure. It can be used both for the purposes of linguistic research and for training statistical parser, which in turn can serve different purposes of natural language processing. For creating a good treebank, manual annotation and/or disambiguation is required. There is a tool for doing syntactic annotation called brat. However, the tool has a number of issues. Firstly, it does not allow a user to edit the source. Secondly, it does not allow to edit tokenisation. Basically, this interface lacks a lot of features that could be very useful for annotation. Finally, it requires a web-server in order to be used by a team of annotators.

Before GSoC 2017 Apertium had a web-interface for visualising syntactic trees. The interface allowed the user to either enter their trees in the text area or upload a treebank from a file and switch between sentences.

The aim of this project was to create an easy-to-use, quick and interactive interface tool for Universal Dependencies annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes.

Main contributions[edit]

Visualisation[edit]

Primarily, the tool was using brat's JavaScript library for visualisation. As a part of this project, I have rewritten the visualisation part using the Cytoscape library. Cytoscape is a JS graph library primarily developed for biologists, but avaliable to use for different purposes. This was made to add functionality which brat's visualisation library could not provide, namely, easier implementation of editing functionality and alignment (right-to-left, top to bottom) settings.

The source code for visualisation support is located in ./standalone/lib/visualiser.js and ./standalone/lib/cy-style.js.

Editing functionality[edit]

Currently, the interface allows to:

  • draw depencencies between tokens
  • edit dependency relations
  • delete dependencies
  • edit POS labels
  • edit tokens

Editing POS labels, editing deprels, drawing arcs and deleting arcs are undoable and redoable.

The source code for editing support is mostly located in ./standalone/lib/gui.js.

Format conversion[edit]

The interface allows to work with CoNLL-U and CG3 formats, and to convert the data between the formats. It also allows to either upload or paste corpora in plain text and then convert them into CoNLL-U.

The source code for conversion support is located in ./standalone/lib/coverters.js and ./standalone/lib/CG2conllu.js.

Server version[edit]

There is also a module of the project which makes it possible to deploy the project on server, written on Python3, Flask. The server version provides support for saving user corpora on server and then accessing the saved corpora via unique URL.

The source code for server version support is located in ./server.

The product[edit]

The web-interface is currently available on GitHub pages. The basic manual to the interface is provided on the help page. The editing functionality is briefly described in Editing functionality.

The product has some functionality not present in other tools:

  • It supports multiword tokens visualisation
  • It supports right-to-left alignment (e.g., for sentences in Arabic or Hebrew)
  • It supports vertical alignment, which makes editing of long sentences more convenient

The project's architecture and components[edit]

Dependencies[edit]

All the JS dependencies needed for the standalone version are included in the package. These are:

All the dependencies are located in ./standalone/lib/ext/.

The project's package consists of sever and standalone sub-directories.

Standalone[edit]

The standalone sub-directory contains the version of the product which can function without the server package. This directory contains the dependencies (listed above, located in directory ./ext) and the main native code of the project. The native code is located in the root of ./standalone/lib/ and consists of:

  • annotator.js
  • gui.js
  • visualiser.js
  • converters.js
  • cy-style.js
  • CG2conllu.js

Server[edit]

The server version is written on Python 3, Flask. The server directory contains additional support for deploying the web-interface on a web-server.

Usability testing[edit]

To evaluate the usability of the interface, a small Polish treebank was annotated. The language data is an an extract from the Polish translation of "Three Men in a Boat". During the annotation a number of bugs was found and filed on the project's issue page on GitHub.

To be done[edit]

All the existing bugs and plans are listed on the issue page of the main repository. The main directions of the further development could be:

  • Improving the server version. Currently the only thing server version allows to do is saving user corpora on server. More advanced functionality could be:
    • The ability to create user accounts and have "projects" with a number of uploaded corpora
    • Support for storing the editing history and enabling the user to go back to some point in it
  • Improving the GUI functionality, e.g.:
    • Adding the interface for disambiguation
  • More on format conversion
    • Support for SD parse format
  • Work on style
    • Improving style of the website
    • Adding user-selected style settings