Difference between revisions of "User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013"

Revision as of 16:15, 18 April 2013

Contact information

Name: Oscar Ramirez Jimenez

Email: tuxskar@gmail.com

IRC: tuxskar

Github repo: http://github.com/tuxskar

Tasks and proposed ideas

Interface for creating tagged corpora

Related tasks: Interface for creating tagged corpora

Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
And, a user-friendly interface to train a supervised tagger

Proposed idea

To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions)

The first interface is really simple, it is the Input corpus file UI:

How it works:

You are able to select a corpus file and introduce the content into the text view
Introduce a corpus already tagged
Introduce a compress dump wikipedia file
Modify (or directly enter a new corpus from scratch) the actual corpus
Once you finish your edition you click on the apply button and appears the corpus tagger UI

After insert a file and click apply it tag with the corpus (if it isn't yet) and shows the supervised corpus tagger UI:

How it works:

It shows just the tags word for the ambiguous ones
For each ambiguous word it shows the multiple options that you can select (using a short-cut and the mouse)
You can go to the next and previous word by clicking on the left buttons or with a short-cut as well
You can see the TSX file clicking on the button "check the TSX file" and it shows a TSX file manager UI' with the TSX file you have in the same directory
You can check the performance of the .prob file that you are working on just clicking on the button "Evalute .prob file"
On the button left part you have the corpus tagger status showing how many ambiguous words still the file has
Once you finish you can click on the finish button and the .prob file will be updated with this disambiguated corpus

Interface for measure performance of .prob file

Related task:

Also, some way to evaluate performance of a .prob file

Proposed idea

For the Performance measure .prop file UI:

How it works

First of all we introduce the tagged corpus base in the left text view (if you came from the previous window you already have it)
Now you choose your .prob file
Some how (maybe by selecting or automaticatly) some part of the tagged corpus is selected to train the .prob file
Once all is set up you click on the "measure" button to measure the performance
When the measure is done, on the bottom left you get the accuracy and the information about the performance
After the measure it shows as well the output generated using the .prob file to see where are differences

TSX management file UI

Related task:

It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)

Proposed idea

For the TSX management file UI:

How it works:

First you choose a TSX file (or if you came from the tagger UI it is already inserted)
It shows all the categories of the file and the forbid, enforce and prefer rules
On the second column it says if the category is or not closed
Also you are able to add new labels and items inserting them on the right textview
Once you add new categories or rules on the textview you click on the "Apply" button and they are inserted to the TSX file and update the treeview and clean up the textview
Once you finish the edition you click on save button and the TSX file will be updated if there is some changes not applied

Constraint grammar rule manager UI

Related task:

Including a way to incorporate constraint grammar rules would also be nice.

Proposed idea

For the Constraint grammar rule manager UI:

How it works:

On the left side there is a tagged corpus to be desambiguated using some rules
On the center there is a text view to enter new rules
On the right side there si a button to execute the rules already written on the rules text view
Once you click on the execute rules button it check the taggs that bind with the definition and it update them highlithing them with colours
On the button left there is a status label to shows ambiguous word still are in the corpus
Once you finish the new rules, click on the finish button and the new rules are compiled to get into the gramar.bin file

For the last task A way to take into account automatically new multiwords / different tokenisation. I still don't know how to do it

Timetable and schedule

On the GSOC period

There are 4 main interfaces to be implemented and here is my estimation for each UI and hours to complete them:

Interface	Coding hours	TDD hours
Input corpus files UI	30-40	with TC_ui TDD
Tagger Corpus UI (and Input corpus UI)	60-70	40-50
TSX manager UI	40-50	20-30
Performance Measure .prob UI	70-80	40-50
Constraint Grammar UI	50-60	30-40

For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.

I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debugging/Documentation

gsoc week	week of the year	tasks
1	25th week	I'm still with final exams
2	26th week	Input File UI coding 30h, Tagger Corpus UI coding 5/60h
3	27th week	Tagger Corpus UI coding 35/60h
4	28th week	Tagger Corpus UI coding 20/60h + TDD 15/40h
First Deliverable
5	29th week	Tagger Corpus UI TDD 25/40h, TSX file manager UI coding 10/40h
6	30th week	TSX file manager UI coding 30/40h + TDD 5/20h
7	31st week	TSX file manager UI TDD 15/20h, Constraint Grammar UI coding 20/50h
8	32nd week	Constraint Grammar UI coding 30/50h + TDD 5/30
Second Deliverable
9	33rd week	Constraint Grammar UI TDD 25/30h, Performance measure .prob UI coding 10/70h
10	34th week	Performance measure .prob UI coding 35/70h
11	35th week	Performance measure .prob UI coding 25/70h + TDD 10/40h
12	36th week	Performance measure .prob UI TDD 30/40h, Final documentation 5h
Finalitation

Before gsoc

From 27 May to 17 of June I'll bonding with the apertium community

About the work before the gsoc:

I would like to have every part completely understood, and the technical part ready to start working directly on 27th of June (Once I have finished all my final exams)
Like I have never use text view I have to learn how to manage the colours (with tags as I have read) and somehow select the words by clicking
Also I would like to have the interfaces glade files finished before gsoc, so we have to discuss the interfaces before summer

Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-)

Coding Challenge

I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I'll upload some documentation about in my github repo like finals outputs or the final .prob from the trainer unsupervised to "probe" I have done my homeworks :P

Bio

My name is Oscar Ramirez (tuxskar on the IRC), I'm on 5th year student in Computer Science Engineering at Malaga University (Escuela Tecnica Superior de Ingeniería Informática de Málaga) I finished my bachelor in computer Science (oriented to systems) almost 2 years ago

I have work on a few open source project using C, Python, Django, PyGTK, Android and Arduino but the most important here is the last one. I'm developing a lawyer's office manager on PyGTK (here is the code on github and also on PyPi PyPi)

This is the first year applying for a place in the GSOC and I'm really excited to work with apertium because I have never wonder how to create a good translator and now I have the oportunity to contribute on build a great tool for it :D

This is the first year I have applied for gsoc and I'm really excited to work on it

@@ Line 125: / Line 125: @@
 |}
-For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 350h (32h/week) and in the worst case I estimate 430h (39h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.
+For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.
-I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have stimate I can work less on the first period and finish all on time
+I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time
 There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
-TDD = Test/Debuging/Documentation
+TDD = Test/Debugging/Documentation
 {| class="wikitable" border="1"

Difference between revisions of "User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013"

Revision as of 16:15, 18 April 2013

Contents

Contact information

Tasks and proposed ideas

Interface for creating tagged corpora

Proposed idea

How it works:

How it works:

Interface for measure performance of .prob file

Proposed idea

How it works

TSX management file UI

Proposed idea

How it works:

Constraint grammar rule manager UI

Proposed idea

How it works:

Timetable and schedule

On the GSOC period

Before gsoc

Coding Challenge

Bio

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools