Difference between revisions of "User:Mary.szmary/proposal2017"
Mary.szmary (talk | contribs) |
Mary.szmary (talk | contribs) |
||
(9 intermediate revisions by the same user not shown) | |||
Line 19: | Line 19: | ||
===Reasons why Google and Apertium should sponsor it=== |
===Reasons why Google and Apertium should sponsor it=== |
||
Dependency treebank is a corpus of sentences with annotated dependency structure. It can be used both for the purposes of linguistic research and for training statistical parser, which in turn can serve different purposes of natural language processing. For creating a good treebank, manual annotation and/or disambiguation is required. |
|||
Currently there is an interface for doing syntactic annotation called [http://brat.nlplab.org/ brat] |
Currently there is an interface for doing syntactic annotation called [http://brat.nlplab.org/ brat]. However, the interface has a number of issues. Firstly, it does not allow a user to edit the source. Secondly, it does not allow to edit tokenisation. Basically, this interface lacks a lot of features that could be very useful for annotation. Finally, it requires a web-server in order to be used by a team of annotators. |
||
There is also [https://github.com/jonorthwash/ud-annotatrix a project] aimed to make a toolkit for working with dependency trees in Apertium. At the moment, it allows to visualize the trees. The aim of my project is to create an easy-to-use, quick and interactive interface tool for UD annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes. |
There is also [https://github.com/jonorthwash/ud-annotatrix a project] aimed to make a toolkit for working with dependency trees in Apertium. At the moment, it allows to visualize the trees. The aim of my project is to create an easy-to-use, quick and interactive interface tool for UD annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes. |
||
Apart from serving the wide interests of the linguistic community, the treebanks created with the help of this tool can be used for the purposes of the Apertium machine translation platform itself. E.g., the standardised annotated corpora available through the Universal Dependencies project could be immensely useful to training the Apertium's PoS taggers. Although the taggers do not need the full tree structure, treebanks is the richest source of standardised annotation, which Apertium is able to make use of. |
|||
Moreover, as dependency annotation is more general purpose activity, there would be much more people willing to be involved in it. So, Apertium will be more likely to receive the data for its needs as a side product than by trying to get people to doing Apertium-specific annotation. |
|||
Another possible application is to extract CG rules from dependency annotation. There is no existing tool for doing so yet, but there is also a project aiming to create such a tool. |
|||
Finally, making such a functional and easy-to-use tool for dependency annotation could be beneficial for Apertium in that it will attract wider attention of another linguistic community, also concerned about minority languages problems and creating free/open-source linguistic resources. |
|||
===A description of how and who it will benefit in society=== |
===A description of how and who it will benefit in society=== |
||
Line 29: | Line 35: | ||
===Previous work=== |
===Previous work=== |
||
Apertium has [http://jonorthwash.github.io/visualise.html a web-interface] for visualising syntactic trees written in Java-Script and HTML. The interface works with three annotation formats, namely CoNLL-U, CG-3 and SD. |
Apertium has [http://jonorthwash.github.io/visualise.html a web-interface] for visualising syntactic trees written in Java-Script and HTML. The interface works with three annotation formats, namely CoNLL-U, CG-3 and SD. The current system allows a user to either enter their trees in the text area or upload a treebank from a file and switch between sentences. |
||
===Project description=== |
===Project description=== |
||
Line 48: | Line 54: | ||
** provide keyboard shortcuts for switching between ambiguous analyses in CG-3 view |
** provide keyboard shortcuts for switching between ambiguous analyses in CG-3 view |
||
* syncronise graphical and text modes |
* syncronise graphical and text modes |
||
* allow to save the current treebank on server and give a link for sharing |
* allow to save the current treebank on server and give a link for sharing, as described in [https://github.com/jonorthwash/ud-annotatrix/issues/17 issue #17] |
||
* display the difference between two (or more?) trees in a case of ambiguous analyses |
* display the difference between two (or more?) trees in a case of ambiguous analyses, as described in [https://github.com/jonorthwash/ud-annotatrix/issues/11 issue #11] |
||
* tokenise new text input and create a skeleton for annotation |
* tokenise new text input and create a skeleton for annotation, as described in [https://github.com/jonorthwash/ud-annotatrix/issues/1 issue #1] |
||
'''The main page consists of the following components''': |
'''The main page consists of the following components''': |
||
Line 60: | Line 66: | ||
* Buttons allowing to: |
* Buttons allowing to: |
||
** upload a treebank |
** upload a treebank |
||
** save changes, undo, redo |
** save changes, undo, redo and so on (also done by keyboard shortcuts) |
||
** save the treebank on the server |
** save the treebank on the server |
||
** export the treebank (in CoNLL-U and CG-3) |
** export the treebank (in CoNLL-U and CG-3) |
||
* The bar for switching between sentences below. It will allow to both go back and forward and choose a sentence by number. |
* The bar for switching between sentences below. It will allow to both go back and forward and choose a sentence by number. |
||
If there are too many functions, some buttons will be replaced with a toolbar. |
|||
---- |
---- |
||
Line 76: | Line 83: | ||
*Closer examination and evaluation of the tools that can be used, e.g.: |
*Closer examination and evaluation of the tools that can be used, e.g.: |
||
** [https://github.com/cytoscape/cytoscape.js-edgehandles cytoscape] is a nice tool for making interactive graphs |
** [https://github.com/cytoscape/cytoscape.js-edgehandles cytoscape] is a nice tool for making interactive graphs |
||
** [https://d3js.org/ d3js] is another nice tool |
|||
* Thinking more about the architecture of the app |
* Thinking more about the architecture of the app |
||
Line 85: | Line 93: | ||
====Schedule==== |
====Schedule==== |
||
'''week 1''': |
'''week 1''': coding the skeleton of the project<br /> |
||
'''weeks 2-3''': working on the graphical editor area<br /> |
'''weeks 2-3''': working on the graphical editor area<br /> |
||
'''week 4''': <br /> |
'''week 4''': finishing the basic interface<br /> |
||
'''Deliverable #1''': there is an interface which allows to edit trees an save changes<br /> |
'''Deliverable #1''': there is an interface which allows to edit trees an save changes<br /> |
||
'''week 5''': working on keyboard shortcuts<br /> |
'''week 5''': working on keyboard shortcuts, optimising and increasing usability of the toolbar<br /> |
||
'''week 6''': |
'''week 6''': developing the interface for tree comparison and disambiguation<br /> |
||
'''week 7''': |
'''week 7-8''': working on the saving and sharing part of service<br /> |
||
'''week 8''': <br /> |
|||
'''27 June''': midterm evaluations deadline<br /> |
'''27 June''': midterm evaluations deadline<br /> |
||
'''Deliverable #2'''<br /> |
'''Deliverable #2'''<br /> |
||
'''week 9-10''': <br /> |
'''week 9-10''': increasing the functionality <br /> |
||
'''week 11''': testing, fixing bugs<br /> |
'''week 11''': testing, fixing bugs<br /> |
||
'''week 12''': cleaning up the code, last fixes, writing documentation<br /> |
'''week 12''': cleaning up the code, last fixes, writing documentation<br /> |
||
'''Project completed''': a user-friendly interactive annotation interface is ready |
'''Project completed''': a user-friendly interactive annotation interface is ready to use |
||
== List your skills and give evidence of your qualifications == |
== List your skills and give evidence of your qualifications == |
Latest revision as of 22:04, 2 April 2017
Contents
Contact information[edit]
Name: Maria Sheyanova
E-mail: masha.shejanova@gmail.com
IRC: maryszmary
SourceForge: maryszmary
Phone number: +79169223114
Timezone: UTC+3
Why is it that you are interested in the Apertium project?[edit]
I have participated in GSoC 2016 with Apertium, which made me involved in this project, this is one of the main reasons why I am interested in contributing to Apertium. Another reason is that, being a linguist, I find it beneficial to develop linguistic tools, and Apertium gives me a good opportunity to do so.
Which of the published tasks are you interested in? What do you plan to do?[edit]
I am planning to work on UD-annotatrix. This will include making a user-friendly interface, which would enable linguists to make syntactic annotation quickly and easily.
Reasons why Google and Apertium should sponsor it[edit]
Dependency treebank is a corpus of sentences with annotated dependency structure. It can be used both for the purposes of linguistic research and for training statistical parser, which in turn can serve different purposes of natural language processing. For creating a good treebank, manual annotation and/or disambiguation is required. Currently there is an interface for doing syntactic annotation called brat. However, the interface has a number of issues. Firstly, it does not allow a user to edit the source. Secondly, it does not allow to edit tokenisation. Basically, this interface lacks a lot of features that could be very useful for annotation. Finally, it requires a web-server in order to be used by a team of annotators.
There is also a project aimed to make a toolkit for working with dependency trees in Apertium. At the moment, it allows to visualize the trees. The aim of my project is to create an easy-to-use, quick and interactive interface tool for UD annotation based on the existing Apertium project. The tool should work both online and offline and allow a user to edit the annotation in both graphical and text modes.
Apart from serving the wide interests of the linguistic community, the treebanks created with the help of this tool can be used for the purposes of the Apertium machine translation platform itself. E.g., the standardised annotated corpora available through the Universal Dependencies project could be immensely useful to training the Apertium's PoS taggers. Although the taggers do not need the full tree structure, treebanks is the richest source of standardised annotation, which Apertium is able to make use of. Moreover, as dependency annotation is more general purpose activity, there would be much more people willing to be involved in it. So, Apertium will be more likely to receive the data for its needs as a side product than by trying to get people to doing Apertium-specific annotation. Another possible application is to extract CG rules from dependency annotation. There is no existing tool for doing so yet, but there is also a project aiming to create such a tool. Finally, making such a functional and easy-to-use tool for dependency annotation could be beneficial for Apertium in that it will attract wider attention of another linguistic community, also concerned about minority languages problems and creating free/open-source linguistic resources.
A description of how and who it will benefit in society[edit]
The result of this work is going to be useful for linguists who deal with dependency annotation.
Previous work[edit]
Apertium has a web-interface for visualising syntactic trees written in Java-Script and HTML. The interface works with three annotation formats, namely CoNLL-U, CG-3 and SD. The current system allows a user to either enter their trees in the text area or upload a treebank from a file and switch between sentences.
Project description[edit]
The main idea of this project is to create a graphical interface for UD annotation based on the existing project. (However, in case, it turns out that it is hard to implement an interface with this functionality based on what there already is, it is possible to write it from scratch, using some appropriate framework. Here is a mockup of the interface.
It should be able to perform the following actions:
- allow to edit the tree in graphical mode, namely:
- edit relations
- change relation tags
- change POS tags
- edit tokenisation
- allow to edit the tree in text mode, including:
- allow to view trees in three annotation formats
- provide keyboard shortcuts for switching between ambiguous analyses in CG-3 view
- syncronise graphical and text modes
- allow to save the current treebank on server and give a link for sharing, as described in issue #17
- display the difference between two (or more?) trees in a case of ambiguous analyses, as described in issue #11
- tokenise new text input and create a skeleton for annotation, as described in issue #1
The main page consists of the following components:
- The graphical tree area. If a tree is too big, the user will be suggested to choose one of the following options:
- zoom (by default)
- scrolling in horizontal mode
- vertical alignment and vertical scrolling
- The text area (with options to view the tree in CoNLL-U, CG-3 and SD formats)
- Buttons allowing to:
- upload a treebank
- save changes, undo, redo and so on (also done by keyboard shortcuts)
- save the treebank on the server
- export the treebank (in CoNLL-U and CG-3)
- The bar for switching between sentences below. It will allow to both go back and forward and choose a sentence by number.
If there are too many functions, some buttons will be replaced with a toolbar.
Work plan[edit]
Overview[edit]
post application period
- Understanding the architecture of the existing project
- Improving my knowledge of Java-Script
community bonding period
- Closer examination and evaluation of the tools that can be used, e.g.:
- Thinking more about the architecture of the app
work period
- 1st month: developing the basic architecture of the interface
- 2nd month: working on increasing the usability and efficiency of the tool
- 3rd month: working on additional features, documentation
Schedule[edit]
week 1: coding the skeleton of the project
weeks 2-3: working on the graphical editor area
week 4: finishing the basic interface
Deliverable #1: there is an interface which allows to edit trees an save changes
week 5: working on keyboard shortcuts, optimising and increasing usability of the toolbar
week 6: developing the interface for tree comparison and disambiguation
week 7-8: working on the saving and sharing part of service
27 June: midterm evaluations deadline
Deliverable #2
week 9-10: increasing the functionality
week 11: testing, fixing bugs
week 12: cleaning up the code, last fixes, writing documentation
Project completed: a user-friendly interactive annotation interface is ready to use
List your skills and give evidence of your qualifications[edit]
I'm a 4th year bachelor student of Linguistic Faculty in NRU HSE (Russia).
Programming skills: Python (including Flask, Django, SQLite, Elasticsearch, nltk), Bash, R, Java-Script (not very experienced, but ready to quickly learn).
Other computer skills: HTML, XML, CSS, JSON, deploying python projects on a Ubuntu server
Experience: I already worked on a couple of projects which included making a web-interface. For example, a have written a web-interface for a unified online dictionary of antonyms. The project is written in Flask (Python), HTML and CSS and uses SQLite database. I also have another Flask project, which consists in making online linguistic exercises and also uses SQLite. Apart from these projects, I am currently working on a larger team project, which is to make a universal linguistic corpus platform. The frameworks used in this project include Elasticsearch (NoSQL db) and Django. We are also going to use JavaScript in it, so, by the end of spring I will be much more experienced with the language.
Natural languages: Russian (native), Polish, English, German, basic knowledge of Indonesian.
Coding challenge: I’ve fixed the #18 issue on the project's github.
List any non-Summer-of-Code plans you have for the Summer[edit]
At the beginning of June I will be finishing a project aimed to make a universal corpus platform, but, as the work on that project is mostly going to be done earlier, I do not think, that it will significantly affect my ability to work on the GSoC project. After that, I have no other no-GSoC plans.