Difference between revisions of "User:Mihirrege/GSOC 2013 Application - Interface for creating tagged corpora"

From Apertium
Jump to navigation Jump to search
Line 21: Line 21:
   
 
== Which of the published tasks are you interested in? What do you plan to do? ==
 
== Which of the published tasks are you interested in? What do you plan to do? ==
  +
There are currently three major interfaces:
  +
a) Manual disambiguator
  +
b) .prob evaluator
  +
c) .tsx file editor
  +
I have put up all the mockups together at http://imgur.com/a/4uk4q#r5Ur8jT and have also put links in separate sections.
   
  +
a) Manual disambiguator
  +
Mockup: http://i.imgur.com/r5Ur8jT.png
  +
Functions:
  +
Jump to next ambiguous lexical unit or adjacent lexical-unit using the keyboard or mouse.
  +
A quick-view bound to a key, to hide the tags and show the raw text
  +
If the .tsx file is provided, information like the coarse tags, forbid, enforce rules applicable can also be displayed.
  +
Show statistics of disambiguation
  +
Compile and apply constraint grammar rules to the buffer
  +
List the applied constraint grammar rules
  +
Train and test the tagger (a prompt will ask the part of the corpus to be used as testing data).
  +
Train the tagger and export the .prob file
  +
Save progress ( this will save the corpus and also create a project description file which will keep track of the morphological analyser, .tsx files used, so that it is easier to resume tagging)
  +
The interface will be keyboard centric, though it will be equally functional with a mouse.
  +
Default keymaps will be provided and the bindings can be changed to suit the user
  +
For example
  +
[P] - <previous-ambiguous>
  +
[N] - <next-ambiguous>
  +
[F] - <forward-word>
  +
[B] - <back-word>
  +
[1], [2],[3],[4] for choosing the correct lexical form.
  +
Evaluating the tagger
  +
Functions
  +
The trained tagger can be evaluated immediately by having an option of setting aside x% of the corpus as testing data.
  +
Else, it can be evaluated using the .prob evaluator using an unrelated corpus.
  +
Loading the corpus
  +
The available options are:
  +
Load a raw-text file, morphological analyser and .tsx file (optional)
  +
Continue on an existing project
  +
Pull a wiki-dump and use it as the corpus [ http://i.imgur.com/F9OXMs4.png ]
   
  +
b) .prob evaluator
  +
Mockup: http://i.imgur.com/fIo6rV9.png
  +
Functions
  +
Input the .prob file , the manually disambiguated corpus along with morphologically analysed corpus or the morphological analyser for the language.
  +
Evaluate the .prob file and display statistics about tagger accuracy
  +
Generate a log file, which will basically be the diff between the provided tagged corpus and the corpus disambiguated by the tagger, making it easier to frame new sentences to add to the corpus, so as to give more context to the tagger
  +
c) .tsx file editor
  +
Mockup:
  +
TSX Viewer
  +
http://i.imgur.com/pVdsIem.png
  +
Templates
  +
http://i.imgur.com/hFGIQHR.png
  +
Functions:
  +
Add new tags
  +
categories
  +
multi-categories
  +
forbid
  +
enforce
  +
prefer
  +
Templates for adding new tags
  +
Change the order of the tags (as more specific categories must be defined before more general ones) within the same parent tag. The nodes in the xml viewer can also be made draggable within the same parent node to make it easier to change the order
  +
Search within tags for faster navigation.
  +
Validate the tagger definition
  +
Editor features like syntax highlighting , auto-indentation and tag completion for manual editing in the Node Contents textview for complex in-place editing.
   
 
== Work plan ==
 
== Work plan ==

Revision as of 19:40, 1 May 2013

Name

Mihir Rege

Contact information

E-mail: mihirrege@gmail.com

Why are you interested in machine translation?

Why are you interested in the Apertium project?

Why Google and Apertium should sponsor it?

How and who it will benefit in society?

Which of the published tasks are you interested in? What do you plan to do?

There are currently three major interfaces: a) Manual disambiguator b) .prob evaluator c) .tsx file editor I have put up all the mockups together at http://imgur.com/a/4uk4q#r5Ur8jT and have also put links in separate sections.

a) Manual disambiguator Mockup: http://i.imgur.com/r5Ur8jT.png Functions: Jump to next ambiguous lexical unit or adjacent lexical-unit using the keyboard or mouse. A quick-view bound to a key, to hide the tags and show the raw text If the .tsx file is provided, information like the coarse tags, forbid, enforce rules applicable can also be displayed. Show statistics of disambiguation Compile and apply constraint grammar rules to the buffer List the applied constraint grammar rules Train and test the tagger (a prompt will ask the part of the corpus to be used as testing data). Train the tagger and export the .prob file Save progress ( this will save the corpus and also create a project description file which will keep track of the morphological analyser, .tsx files used, so that it is easier to resume tagging) The interface will be keyboard centric, though it will be equally functional with a mouse. Default keymaps will be provided and the bindings can be changed to suit the user For example [P] - <previous-ambiguous> [N] - <next-ambiguous> [F] - <forward-word> [B] - <back-word> [1], [2],[3],[4] for choosing the correct lexical form. Evaluating the tagger Functions The trained tagger can be evaluated immediately by having an option of setting aside x% of the corpus as testing data. Else, it can be evaluated using the .prob evaluator using an unrelated corpus. Loading the corpus The available options are: Load a raw-text file, morphological analyser and .tsx file (optional) Continue on an existing project Pull a wiki-dump and use it as the corpus [ http://i.imgur.com/F9OXMs4.png ]

b) .prob evaluator Mockup: http://i.imgur.com/fIo6rV9.png Functions Input the .prob file , the manually disambiguated corpus along with morphologically analysed corpus or the morphological analyser for the language. Evaluate the .prob file and display statistics about tagger accuracy Generate a log file, which will basically be the diff between the provided tagged corpus and the corpus disambiguated by the tagger, making it easier to frame new sentences to add to the corpus, so as to give more context to the tagger c) .tsx file editor Mockup: TSX Viewer http://i.imgur.com/pVdsIem.png Templates http://i.imgur.com/hFGIQHR.png Functions: Add new tags categories multi-categories forbid enforce prefer Templates for adding new tags Change the order of the tags (as more specific categories must be defined before more general ones) within the same parent tag. The nodes in the xml viewer can also be made draggable within the same parent node to make it easier to change the order Search within tags for faster navigation. Validate the tagger definition Editor features like syntax highlighting , auto-indentation and tag completion for manual editing in the Node Contents textview for complex in-place editing.

Work plan

Coding challenge

Community Bonding Period

Week Plan

Week Plan
Week 01
Week 02
Week 03
Week 04
Deliverable #1
Week 05
Week 06
Week 07
Week 08
Deliverable #2
Week 09
Week 10
Week 11
Week 12
Deliverable #3

List your skills and give evidence of your qualifications

My non-Summer-of-Code plans for the Summer