Difference between revisions of "User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013"
Line 125: | Line 125: | ||
|} |
|} |
||
For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is |
For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation. |
||
I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have |
I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time |
||
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h |
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h |
||
TDD = Test/ |
TDD = Test/Debugging/Documentation |
||
{| class="wikitable" border="1" |
{| class="wikitable" border="1" |
Revision as of 16:15, 18 April 2013
Contents
Contact information
Name: Oscar Ramirez Jimenez
Email: tuxskar@gmail.com
IRC: tuxskar
Github repo: http://github.com/tuxskar
Tasks and proposed ideas
Interface for creating tagged corpora
Related tasks: Interface for creating tagged corpora
- Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
- The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
- And, a user-friendly interface to train a supervised tagger
Proposed idea
To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions)
The first interface is really simple, it is the Input corpus file UI:
How it works:
- You are able to select a corpus file and introduce the content into the text view
- Introduce a corpus already tagged
- Introduce a compress dump wikipedia file
- Modify (or directly enter a new corpus from scratch) the actual corpus
- Once you finish your edition you click on the apply button and appears the corpus tagger UI
After insert a file and click apply it tag with the corpus (if it isn't yet) and shows the supervised corpus tagger UI:
How it works:
- It shows just the tags word for the ambiguous ones
- For each ambiguous word it shows the multiple options that you can select (using a short-cut and the mouse)
- You can go to the next and previous word by clicking on the left buttons or with a short-cut as well
- You can see the TSX file clicking on the button "check the TSX file" and it shows a TSX file manager UI' with the TSX file you have in the same directory
- You can check the performance of the .prob file that you are working on just clicking on the button "Evalute .prob file"
- On the button left part you have the corpus tagger status showing how many ambiguous words still the file has
- Once you finish you can click on the finish button and the .prob file will be updated with this disambiguated corpus
Interface for measure performance of .prob file
Related task:
- Also, some way to evaluate performance of a .prob file
Proposed idea
For the Performance measure .prop file UI:
How it works
- First of all we introduce the tagged corpus base in the left text view (if you came from the previous window you already have it)
- Now you choose your .prob file
- Some how (maybe by selecting or automaticatly) some part of the tagged corpus is selected to train the .prob file
- Once all is set up you click on the "measure" button to measure the performance
- When the measure is done, on the bottom left you get the accuracy and the information about the performance
- After the measure it shows as well the output generated using the .prob file to see where are differences
TSX management file UI
Related task:
- It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)
Proposed idea
For the TSX management file UI:
How it works:
- First you choose a TSX file (or if you came from the tagger UI it is already inserted)
- It shows all the categories of the file and the forbid, enforce and prefer rules
- On the second column it says if the category is or not closed
- Also you are able to add new labels and items inserting them on the right textview
- Once you add new categories or rules on the textview you click on the "Apply" button and they are inserted to the TSX file and update the treeview and clean up the textview
- Once you finish the edition you click on save button and the TSX file will be updated if there is some changes not applied
Constraint grammar rule manager UI
Related task:
- Including a way to incorporate constraint grammar rules would also be nice.
Proposed idea
For the Constraint grammar rule manager UI:
How it works:
- On the left side there is a tagged corpus to be desambiguated using some rules
- On the center there is a text view to enter new rules
- On the right side there si a button to execute the rules already written on the rules text view
- Once you click on the execute rules button it check the taggs that bind with the definition and it update them highlithing them with colours
- On the button left there is a status label to shows ambiguous word still are in the corpus
- Once you finish the new rules, click on the finish button and the new rules are compiled to get into the gramar.bin file
For the last task A way to take into account automatically new multiwords / different tokenisation. I still don't know how to do it
Timetable and schedule
On the GSOC period
There are 4 main interfaces to be implemented and here is my estimation for each UI and hours to complete them:
Interface | Coding hours | TDD hours |
---|---|---|
Input corpus files UI | 30-40 | with TC_ui TDD |
Tagger Corpus UI (and Input corpus UI) | 60-70 | 40-50 |
TSX manager UI | 40-50 | 20-30 |
Performance Measure .prob UI | 70-80 | 40-50 |
Constraint Grammar UI | 50-60 | 30-40 |
For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.
I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
TDD = Test/Debugging/Documentation
gsoc week | week of the year | tasks |
---|---|---|
1 | 25th week | I'm still with final exams |
2 | 26th week | Input File UI coding 30h, Tagger Corpus UI coding 5/60h |
3 | 27th week | Tagger Corpus UI coding 35/60h |
4 | 28th week | Tagger Corpus UI coding 20/60h + TDD 15/40h |
First Deliverable | ||
5 | 29th week | Tagger Corpus UI TDD 25/40h, TSX file manager UI coding 10/40h |
6 | 30th week | TSX file manager UI coding 30/40h + TDD 5/20h |
7 | 31st week | TSX file manager UI TDD 15/20h, Constraint Grammar UI coding 20/50h |
8 | 32nd week | Constraint Grammar UI coding 30/50h + TDD 5/30 |
Second Deliverable | ||
9 | 33rd week | Constraint Grammar UI TDD 25/30h, Performance measure .prob UI coding 10/70h |
10 | 34th week | Performance measure .prob UI coding 35/70h |
11 | 35th week | Performance measure .prob UI coding 25/70h + TDD 10/40h |
12 | 36th week | Performance measure .prob UI TDD 30/40h, Final documentation 5h |
Finalitation |
Before gsoc
From 27 May to 17 of June I'll bonding with the apertium community
About the work before the gsoc:
- I would like to have every part completely understood, and the technical part ready to start working directly on 27th of June (Once I have finished all my final exams)
- Like I have never use text view I have to learn how to manage the colours (with tags as I have read) and somehow select the words by clicking
- Also I would like to have the interfaces glade files finished before gsoc, so we have to discuss the interfaces before summer
Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-)
Coding Challenge
I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I'll upload some documentation about in my github repo like finals outputs or the final .prob from the trainer unsupervised to "probe" I have done my homeworks :P
Bio
My name is Oscar Ramirez (tuxskar on the IRC), I'm on 5th year student in Computer Science Engineering at Malaga University (Escuela Tecnica Superior de Ingeniería Informática de Málaga) I finished my bachelor in computer Science (oriented to systems) almost 2 years ago
I have work on a few open source project using C, Python, Django, PyGTK, Android and Arduino but the most important here is the last one. I'm developing a lawyer's office manager on PyGTK (here is the code on github and also on PyPi PyPi)
This is the first year applying for a place in the GSOC and I'm really excited to work with apertium because I have never wonder how to create a good translator and now I have the oportunity to contribute on build a great tool for it :D
This is the first year I have applied for gsoc and I'm really excited to work on it