Difference between revisions of "UD annotatrix/UD annotatrix at GSoC 2017"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
 
== Commitment ==
 
== Commitment ==
   
The contributions were made to the master branch of the [https://github.com/jonorthwash/ud-annotatrix UD annotatrix repository on GitHub].
+
The contributions were made to the master branch of the [https://github.com/jonorthwash/ud-annotatrix UD annotatrix repository on GitHub] ([https://github.com/jonorthwash/ud-annotatrix/commit/41d9d0bb19b804a82347fc4dc2e99d6a83887f91 last commit]).
   
 
== The idea ==
 
== The idea ==

Revision as of 15:21, 29 August 2017

Commitment

The contributions were made to the master branch of the UD annotatrix repository on GitHub (last commit).

The idea

Dependency treebank is a corpus of sentences with annotated dependency structure. It can be used both for the purposes of linguistic research and for training statistical parser, which in turn can serve different purposes of natural language processing. For creating a good treebank, manual annotation and/or disambiguation is required. There is a tool for doing syntactic annotation called brat. However, the tool has a number of issues. Firstly, it does not allow a user to edit the source. Secondly, it does not allow to edit tokenisation. Basically, this interface lacks a lot of features that could be very useful for annotation. Finally, it requires a web-server in order to be used by a team of annotators.

Before GSoC 2017 Apertium had a web-interface for visualising syntactic trees. The interface allowed the user to either enter their trees in the text area or upload a treebank from a file and switch between sentences.

The aim of this project was to create an easy-to-use, quick and interactive interface tool for Universal Dependencies annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes.

Main contributions

Visualisation

Primarily, the tool was using brat's JavaScript library for visualisation. As a part of this project, I have rewritten the visualisation part using the Cytoscape library. Cytoscape is a JS graph library primarily developed for biologists, but avaliable to use for different purposes. This was made to add functionality which brat's visualisation library could not provide, namely, easier implementation of editing functionality and alignment (right-to-left, top to bottom) settings.

The source code for visualisation support is located in ./standalone/lib/visualiser.js and ./standalone/lib/cy-style.js.

Editing functionality

Currently, the interface allows to:

  • draw depencencies between tokens
  • edit dependency relations
  • delete dependencies
  • edit POS labels
  • edit tokens

Editing POS labels, editing deprels, drawing arcs and deleting arcs are undoable and redoable.

The source code for editing support is mostly located in ./standalone/lib/gui.js.

Format conversion

The interface allows to work with CoNLL-U and CG3 formats, and to convert the data between the formats. It also allows to ... PLAIN TEXT The source code for conversion support is located in ./standalone/lib/coverters.js and ./standalone/lib/CG2conllu.js.

Server version

There is also a ...

The product

The web-interface is currently available on GitHub pages. The basic manual to the interface is provided on the help page. The editing functionality is briefly described in Editing functionality.

The product has some functionality not present in other tools:

  • It supports multiword tokens visualisation
  • It supports right-to-left alignment (e.g., for sentences in Arabic or Hebrew)
  • It supports vertical alignment, which makes editing of long sentences more convenient

The project's architecture and components

Dependencies

All the JS dependencies needed for the standalone version are included in the package. These are:

The server version is written on Python 3, Flask.

The project's package consists of sever and standalone sub-directories.

Standalone

The standalone sub-directory contains the version of the product which can function without the server package. This directory contains the dependencies (listed above, located in directory ./ext) and the main native code of the project. The native code consists of:

  • annotator.js

Server

The server directory contains additional support ...


Usability testing

To evaluate the usability of the interface, a small Polish treebank was annotated. The language data is an an extract from the Polish translation of "Three Men in a Boat".

To be done

All the existing bugs and plans are listed on the issue page of the main repository. The main directions of the further development could be:

  • Improving the server version. Currently the only thing server version allows to do is saving user corpora on server. More advanced functionality could be:
    • The ability to create user accounts and have "projects" with a number of uploaded corpora
    • Editing history managing
  • Improving the GUI functionality
  • More on format conversion
    • SD parse
  • Work on style
    • Improving style of the website
    • Adding user-selected style settings