User:Gor ar/proposal 2017
GSoC 2017 Proposal: UD and Apertium Integration.
Name: Gor Arakelyan
I am a second-year student in YSU (Yerevan State University) at the department of Informatics and Applied Mathematics.
I am interested in natural language processing, especially for low resource languages, like my native language Armenian. Apertium seems to be a perfect platform for that.
The most important problem for Armenian NLP (and possibly for many others) is the lack of a properly annotated treebank. In order to help linguists to quickly annotate large amounts of text, an annotation tool with easy to use interface is required. I believe UD annotatrix is a very good tool to start with.
UD annotatrix should be extended to support fast and efficient annotation. I propose three stages: basic version, integration with external tools and advanced version.
- The main interface contains a large text box and graph visualizer (similar to the current interface). There are buttons for navigating between sentences.
- There is an import button that allows to import large corpus in multiple formats.
- There are 5 tabs above the text box, one per each file format
1. CONLL-U 2. Stanford Dependencies 3. CG3 (text mode) 4. CG3 (line mode) 5. Raw text
- CG3 (line mode) displays a list of lines of text (instead of raw text). Something similar to what is described [here](https://github.com/jonorthwash/ud-annotatrix/issues/10). Up-down arrows can be used to select a line. "Delete" key will remove the current line. "Enter" key will remove all lines for the current word except the current one.
- CG3 (text mode) has a shortcut key to automatically add numbers next to the words (#1)
- Each pair of formats has a validation procedure, which determines whether the current format can be transformed into the other without losing information. For example, CG3 supports multiple annotations of the same word which cannot be expressed in CONLL-U. In these cases the program does not allow to change the format.
- CONLL-U mode supports a shortcut key to split multiword tokens. User can select some part of the word in the second column and press Ctrl+M, and the program will automatically add 2 new rows according to UD format. This should be properly displayed in the graph.
Integration with external tools
- Support for a simple API to communicate with the server. In the simplest case, there is a configuration window where user can specify a URL to a server which will handle HTTP POST/GET commands to save and load. It can be specified that the server handles only CONLL-U, so if user attempts to save in CG3 mode, the content should pass CG3 -> CONLL-U validation.
- Support for external tokenisation tools. In case the imported text is just a raw text, the tool can call another API that will return the tokenized text.
- Support for external morphological analysis tools. Given raw text, one more button can call some URL (given in the configuration window) and obtain the results of morphological analysis in CG3 (with multiple options per word).
- Support for external dependency parsers. This can be useful in case there already exist some low quality parsers that can help to pre-annotate the current sentence.
- Sample lightweight servers can be included in the main github repo that wrap commonly used tools.
- Text boxes with rich IDE-like editors that support autocomplete.
- Each of the 3 formats has its own "language" with format-specific keywords.
- Additionally, it is possible to specify language-specific CONLL-U keywords.
- Errors in the annotations (typos in keywords etc.) are highlighted (similar to spell checkers in word editors).
I have been doing web development for 3 years now. Recently I worked on an open source tool for corpus management (currently used for Armenian only). It involved coding in HTML, JS and Python.