Difference between revisions of "UD annotatrix/UD annotatrix at GSoC 2017"

From Apertium
Jump to navigation Jump to search
 
(26 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Commitment ==
 
== Commitment ==
   
The contributions were made to the master branch of the [https://github.com/jonorthwash/ud-annotatrix UD annotatrix repository on GitHub].
+
The contributions were made to the master branch of the [https://github.com/jonorthwash/ud-annotatrix UD annotatrix repository on GitHub] ([https://github.com/jonorthwash/ud-annotatrix/commit/41d9d0bb19b804a82347fc4dc2e99d6a83887f91 last commit]).
   
 
== The idea ==
 
== The idea ==
   
  +
Dependency treebank is a corpus of sentences with annotated dependency structure. It can be used both for the purposes of linguistic research and for training statistical parser, which in turn can serve different purposes of natural language processing. For creating a good treebank, manual annotation and/or disambiguation is required.
Before GSoC 2017 Apertium had a web-interface for visualising syntactic trees written in Java-Script and HTML. The system allowed the user to either enter their trees in the text area or upload a treebank from a file and switch between sentences. The aim of this project was to create a graphical editing interface for Universal Dependencies annotation based on the existing project.
 
  +
There is a tool for doing syntactic annotation called [http://brat.nlplab.org/ brat]. However, the tool has a number of issues. Firstly, it does not allow a user to edit the source. Secondly, it does not allow to edit tokenisation. Basically, this interface lacks a lot of features that could be very useful for annotation. Finally, it requires a web-server in order to be used by a team of annotators.
   
 
Before GSoC 2017 Apertium had a web-interface for visualising syntactic trees. The interface allowed the user to either enter their trees in the text area or upload a treebank from a file and switch between sentences.
== Main work on the project ==
 
  +
  +
The aim of this project was to create an easy-to-use, quick and interactive interface tool for Universal Dependencies annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes.
  +
  +
== Main contributions ==
   
 
=== Visualisation ===
 
=== Visualisation ===
   
Primarily, the tool was using brat's JavaScript library for visualisation. As a part of this project, I have rewritten the visualisation part using the Cytoscape library. Cytoscape is a JS graph library primarily developed for biologists, but avaliable to use for different purposes. This was made to add functionality which brat's visualisation library could not provide, namely, easier implementation of editing functionality and alignment (RTL, top to bottom) settings.
+
Primarily, the tool was using brat's JavaScript library for visualisation. As a part of this project, I have rewritten the visualisation part using the Cytoscape library. Cytoscape is a JS graph library primarily developed for biologists, but avaliable to use for different purposes. This was made to add functionality which brat's visualisation library could not provide, namely, easier implementation of editing functionality and alignment (right-to-left, top to bottom) settings.
   
The source code for visualisation support is mostly located in ./standalone/lib/visualiser.js.
+
The source code for visualisation support is located in ./standalone/lib/visualiser.js and ./standalone/lib/cy-style.js.
   
 
=== Editing functionality ===
 
=== Editing functionality ===
Line 19: Line 24:
 
Currently, the interface allows to:
 
Currently, the interface allows to:
   
* Draw depencencies
+
* draw depencencies between tokens
  +
* edit dependency relations
* Label dependencies
+
* delete dependencies
  +
* edit POS labels
  +
* edit tokens
   
  +
Editing POS labels, editing deprels, drawing arcs and deleting arcs are undoable and redoable.
The source code for visualisation support is mostly located in ./standalone/lib/gui.js.
 
  +
 
The source code for editing support is mostly located in ./standalone/lib/gui.js.
  +
  +
=== Format conversion ===
  +
  +
The interface allows to work with CoNLL-U and CG3 formats, and to convert the data between the formats. It also allows to either upload or paste corpora in plain text and then convert them into CoNLL-U.
  +
  +
The source code for conversion support is located in ./standalone/lib/coverters.js and ./standalone/lib/CG2conllu.js.
   
 
=== Server version ===
 
=== Server version ===
   
  +
There is also a module of the project which makes it possible to deploy the project on server, written on Python3, Flask.
There is also a ...
 
  +
The server version provides support for saving user corpora on server and then accessing the saved corpora via unique URL.
  +
  +
The source code for server version support is located in ./server.
   
 
== The product ==
 
== The product ==
   
The web-interface is currently available on GitHub pages: [https://maryszmary.github.io/ud-annotatrix/standalone/annotator.html]. The basic manual to the interface is provided on the [https://maryszmary.github.io/ud-annotatrix/standalone/help.html help page].
+
The web-interface is currently available on [https://maryszmary.github.io/ud-annotatrix/standalone/annotator.html GitHub pages]. The basic manual to the interface is provided on the [https://maryszmary.github.io/ud-annotatrix/standalone/help.html help page]. The editing functionality is briefly described in [http://wiki.apertium.org/wiki/UD_annotatrix/UD_annotatrix_at_GSoC_2017#Editing_functionality Editing functionality].
   
 
The product has some functionality not present in other tools:
 
The product has some functionality not present in other tools:
 
* It supports multiword tokens visualisation
 
* It supports multiword tokens visualisation
* It supports RTL sentences
+
* It supports right-to-left alignment (e.g., for sentences in Arabic or Hebrew)
  +
* It supports vertical alignment, which makes editing of long sentences more convenient
*
 
   
 
== The project's architecture and components ==
 
== The project's architecture and components ==
Line 46: Line 65:
 
* Cytoscape
 
* Cytoscape
 
* head.js
 
* head.js
  +
* a [https://github.com/FrancessFractal/conllu JS library for parsing conllu] written by [https://github.com/FrancessFractal Magdalena Parks].
   
  +
All the dependencies are located in ./standalone/lib/ext/.
The server version is written on Python 3, Flask.
 
   
 
The project's package consists of sever and standalone sub-directories.
 
The project's package consists of sever and standalone sub-directories.
Line 53: Line 73:
 
=== Standalone ===
 
=== Standalone ===
   
The standalone sub-directory contains the version of the product which can function without the server package. This directory contains the dependencies (listed above, located in directory ./ext) and the main native code of the project. The native code consists of:
+
The standalone sub-directory contains the version of the product which can function without the server package. This directory contains the dependencies (listed above, located in directory ./ext) and the main native code of the project. The native code is located in the root of ./standalone/lib/ and consists of:
   
 
* annotator.js
 
* annotator.js
  +
* gui.js
  +
* visualiser.js
  +
* converters.js
  +
* cy-style.js
  +
* CG2conllu.js
   
 
=== Server ===
 
=== Server ===
   
The server directory contains additional support ...
+
The server version is written on Python 3, Flask. The server directory contains additional support for deploying the web-interface on a web-server.
 
   
 
== Usability testing ==
 
== Usability testing ==
   
  +
To evaluate the usability of the interface, a small Polish treebank was annotated. The [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-pol-rus/texts/jkj.pol.txt language data] is an an extract from the Polish translation of "Three Men in a Boat". During the annotation a number of bugs was found and filed on the project's issue page on GitHub.
To evaluate the usability of the interface, ...
 
   
 
== To be done ==
 
== To be done ==
Line 70: Line 94:
 
All the existing bugs and plans are listed on the [https://github.com/jonorthwash/ud-annotatrix/issues issue page] of the main repository.
 
All the existing bugs and plans are listed on the [https://github.com/jonorthwash/ud-annotatrix/issues issue page] of the main repository.
 
The main directions of the further development could be:
 
The main directions of the further development could be:
* Improving the server version. Currently the only thing server version allows to do is saving user corpora on server. Other options could be:
+
* Improving the server version. Currently the only thing server version allows to do is saving user corpora on server. More advanced functionality could be:
 
** The ability to create user accounts and have "projects" with a number of uploaded corpora
 
** The ability to create user accounts and have "projects" with a number of uploaded corpora
  +
** Support for storing the editing history and enabling the user to go back to some point in it
** Editing history managing
 
* Improving the GUI functionality
+
* Improving the GUI functionality, e.g.:
  +
** Adding the interface for disambiguation
 
* More on format conversion
 
* More on format conversion
** SD parse
+
** Support for SD parse format
 
* Work on style
 
* Work on style
 
** Improving style of the website
 
** Improving style of the website

Latest revision as of 15:59, 29 August 2017

Commitment[edit]

The contributions were made to the master branch of the UD annotatrix repository on GitHub (last commit).

The idea[edit]

Dependency treebank is a corpus of sentences with annotated dependency structure. It can be used both for the purposes of linguistic research and for training statistical parser, which in turn can serve different purposes of natural language processing. For creating a good treebank, manual annotation and/or disambiguation is required. There is a tool for doing syntactic annotation called brat. However, the tool has a number of issues. Firstly, it does not allow a user to edit the source. Secondly, it does not allow to edit tokenisation. Basically, this interface lacks a lot of features that could be very useful for annotation. Finally, it requires a web-server in order to be used by a team of annotators.

Before GSoC 2017 Apertium had a web-interface for visualising syntactic trees. The interface allowed the user to either enter their trees in the text area or upload a treebank from a file and switch between sentences.

The aim of this project was to create an easy-to-use, quick and interactive interface tool for Universal Dependencies annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes.

Main contributions[edit]

Visualisation[edit]

Primarily, the tool was using brat's JavaScript library for visualisation. As a part of this project, I have rewritten the visualisation part using the Cytoscape library. Cytoscape is a JS graph library primarily developed for biologists, but avaliable to use for different purposes. This was made to add functionality which brat's visualisation library could not provide, namely, easier implementation of editing functionality and alignment (right-to-left, top to bottom) settings.

The source code for visualisation support is located in ./standalone/lib/visualiser.js and ./standalone/lib/cy-style.js.

Editing functionality[edit]

Currently, the interface allows to:

  • draw depencencies between tokens
  • edit dependency relations
  • delete dependencies
  • edit POS labels
  • edit tokens

Editing POS labels, editing deprels, drawing arcs and deleting arcs are undoable and redoable.

The source code for editing support is mostly located in ./standalone/lib/gui.js.

Format conversion[edit]

The interface allows to work with CoNLL-U and CG3 formats, and to convert the data between the formats. It also allows to either upload or paste corpora in plain text and then convert them into CoNLL-U.

The source code for conversion support is located in ./standalone/lib/coverters.js and ./standalone/lib/CG2conllu.js.

Server version[edit]

There is also a module of the project which makes it possible to deploy the project on server, written on Python3, Flask. The server version provides support for saving user corpora on server and then accessing the saved corpora via unique URL.

The source code for server version support is located in ./server.

The product[edit]

The web-interface is currently available on GitHub pages. The basic manual to the interface is provided on the help page. The editing functionality is briefly described in Editing functionality.

The product has some functionality not present in other tools:

  • It supports multiword tokens visualisation
  • It supports right-to-left alignment (e.g., for sentences in Arabic or Hebrew)
  • It supports vertical alignment, which makes editing of long sentences more convenient

The project's architecture and components[edit]

Dependencies[edit]

All the JS dependencies needed for the standalone version are included in the package. These are:

All the dependencies are located in ./standalone/lib/ext/.

The project's package consists of sever and standalone sub-directories.

Standalone[edit]

The standalone sub-directory contains the version of the product which can function without the server package. This directory contains the dependencies (listed above, located in directory ./ext) and the main native code of the project. The native code is located in the root of ./standalone/lib/ and consists of:

  • annotator.js
  • gui.js
  • visualiser.js
  • converters.js
  • cy-style.js
  • CG2conllu.js

Server[edit]

The server version is written on Python 3, Flask. The server directory contains additional support for deploying the web-interface on a web-server.

Usability testing[edit]

To evaluate the usability of the interface, a small Polish treebank was annotated. The language data is an an extract from the Polish translation of "Three Men in a Boat". During the annotation a number of bugs was found and filed on the project's issue page on GitHub.

To be done[edit]

All the existing bugs and plans are listed on the issue page of the main repository. The main directions of the further development could be:

  • Improving the server version. Currently the only thing server version allows to do is saving user corpora on server. More advanced functionality could be:
    • The ability to create user accounts and have "projects" with a number of uploaded corpora
    • Support for storing the editing history and enabling the user to go back to some point in it
  • Improving the GUI functionality, e.g.:
    • Adding the interface for disambiguation
  • More on format conversion
    • Support for SD parse format
  • Work on style
    • Improving style of the website
    • Adding user-selected style settings