User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013

From Apertium
Jump to navigation Jump to search

Contact information

Name: Oscar Ramirez Jimenez

Email: tuxskar@gmail.com

IRC: tuxskar

Github repo: http://github.com/tuxskar

Tasks and proposed ideas

Interface for creating tagged corpora

Related tasks: Interface for creating tagged corpora

  • Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
  • The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
  • And, a user-friendly interface to train a supervised tagger

Proposed idea

To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions)

The first interface is really simple, it is the Input corpus file UI:

Input corpus file UI

How it works:

  • You are able to select a corpus file and introduce the content into the text view
  • Modify (or directly enter a new corpus from scratch) the actual corpus
  • Once you finish your edition you click on the apply button and appears the corpus tagger UI


Here you can see the 2 interfaces versions for the manage supervised corpus tagger UI:

  • First version: you have the corpus untagged and go word by word checking the correct word depending of the options on the right, once you click the right one the others options are disabled
Supervised corpus tagger UI V1
  • Second version (version proposed by Fran): you have on the left text view the corpus tagged to be disambiguated and selected each ambiguous words, highlighted by colours, to choose the correct one (some how, if it is possible by clicking on them).

If you put the mouse hover the word you have more info about the word properties

Also there are 2 buttons to highlight more the next and previous words

Supervised corpus tagger UI V2


The difference between them is that the first lets you introduce a corpus file and modify it in the text view using the just buttons, and the second one let you select the correct word on the textview directly

Both design have 2 buttons on the bottom right, one to finish the supervision and the other to lunch the performance mesure .prob files, the bottom left label shows the status of the file showing how many word are still ambiguous


Interface for measure performance of .prob file

Related task:

  • Also, some way to evaluate performance of a .prob file

Proposed idea

For the Performance measure .prop file UI:

Performance Measure prob UI.png

How it works

  • First of all we introduce the tagged corpus base in the left text view (if you came from the previous window you already have it)
  • Now you choose your .prob file
  • Some how (maybe by selecting or automaticatly) some part of the tagged corpus is selected to train the .prob file
  • Once all is set up you click on the "measure" button to measure the performance
  • When the measure is done, on the bottom left you get the accuracy and the information about the performance
  • After the measure it shows as well the output generated using the .prob file to see where are differences

TSX management file UI

Related task:

  • It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)

Proposed idea

For the TSX management file UI:

TSX manager UI.png

How it works:

  • First you choose a TSX file
  • it shows all the tags on it, with the name and every item (or items)
  • On the second column it says if the word is closed or not
  • Also you are able to add new labels and items (to add new you have to select the parent label)
  • Once you finish the edition you click on save button and the TSX file will be updated



Constraint grammar rule manager UI

Related task:

  • Including a way to incorporate constraint grammar rules would also be nice.

Proposed idea

For the Constraint grammar rule manager UI:

Constraint grammar UI.png

How it works:

  • On the left side there is a tagged corpus to be desambiguated using some rules
  • On the center there is a text view to enter new rules
  • On the right side there si a button to execute the rules already written on the rules text view
  • Once you click on the execute rules button it check the taggs that bind with the definition and it update them highlithing them with colours
  • On the button left there is a status label to shows ambiguous word still are in the corpus
  • Once you finish the new rules, click on the finish button and the new rules are compiled to get into the gramar.bin file

For the last task A way to take into account automatically new multiwords / different tokenisation. I still don't know how to do it


Timetable and schedule

On the GSOC period

There are 4 main interfaces to be implemented and here is my estimation for each UI and hours to complete them:

Interface Coding hours TDD hours
Tagger Corpus UI (and Input corpus UI) 90-100 30-40
TSX manager UI 30-40 20-30
Performance Measure .prob UI 70-80 40-50
Constraint Grammar UI 50-60 20-30

For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 350h (32h/week) and in the worst case I estimate 430h (39h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.

I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have stimate I can work less on the first period and finish all on time

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debuging/Documentation

gsoc week week of the year tasks
1 25th week I'm still with final exams
2 26th week Constraint Grammar UI, Coding 35/50h
3 27th week Constraint Grammar UI, Coding 15/50h + TDD 20h
4 28th week TSX file manager UI, Coding 30h
First Deliverable
5 29th week TSX UI, TDD 20h ; Performance Measure Prob UI, coding 15/70h
6 30th week Performance Measure Prob UI, coding 35/70h
7 31st week Performance Measure Prob UI, coding 20/70h + TDD 15/40h
8 32nd week Performance Measure Prob UI, TDD 25/40h
Second Deliverable
9 33rd week Tagged Corpus UI, Coding 35/90h
10 34th week Tagged Corpus UI, Coding 35/90h
11 35th week Tagged Corpus UI, Coding 20/90h + TDD 15/30
12 36th week Tagged Corpus UI, TDD 15/30h + final revisions 15h
Finalitation

Before gsoc

From 27 May to 17 of June I'll bonding with the apertium community

About the work before the gsoc:

  • I would like to have every part completely understood, and the technical part ready to start working directly on 27th of June (Once I have finished all my final exams)
  • Like I have never use text view I have to learn how to manage the colours (with tags as I have read) and somehow select the words by clicking
  • Also I would like to have the interfaces glade files finished before gsoc, so we have to discuss the interfaces before summer

Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-)

Coding Challenge

I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I'll upload some documentation about in my github repo like finals outputs or the final .prob from the trainer unsupervised to "probe" I have done my homeworks :P

Bio

My name is Oscar Ramirez (tuxskar on the IRC), I'm on 5th year student in Computer Science Engineering at Malaga University (Escuela Tecnica Superior de Ingeniería Informática de Málaga) I finished my bachelor in computer Science (oriented to systems) almost 2 years ago

I have work on a few open source project using C, Python, Django, PyGTK, Android and Arduino but the most important here is the last one. I'm developing a lawyer's office manager on PyGTK (here is the code on github and also on PyPi PyPi)

This is the first year applying for a place in the GSOC and I'm really excited to work with apertium because I have never wonder how to create a good translator and now I have the oportunity to contribute on build a great tool for it :D

This is the first year I have applied for gsoc and I'm really excited to work on it