Difference between revisions of "User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013"

Latest revision as of 19:10, 1 May 2013

Contact information[edit]

Name: Oscar Ramirez Jimenez

Email: tuxskar@gmail.com

IRC: tuxskar

Github repo: http://github.com/tuxskar

Tasks and proposed ideas[edit]

Interface for creating tagged corpora[edit]

Related tasks: Interface for creating tagged corpora

Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
And, a user-friendly interface to train a supervised tagger

Proposed idea[edit]

To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions)

The first interface is really simple, it is the Input corpus file UI:

How it works:[edit]

You are able to select a corpus file and introduce the content into the text view
Introduce a corpus already tagged
Introduce a dump wikipedia file stored somewhere (it downloads the file, decompress tags it and inserts it on the textview)
Modify (or directly enter a new corpus from scratch) the actual corpus
Once you finish your edition you click on the apply button and appears the corpus tagger UI

After insert a file and click apply it tag with the corpus (if it isn't yet) and shows the supervised corpus tagger UI:

How it works:[edit]

It shows just the tags word for the ambiguous ones
For each ambiguous word it shows the multiple options that you can select (using a short-cut and the mouse)
You can go to the next and previous word by clicking on the left buttons or with a short-cut as well
You can see the TSX file clicking on the button "check the TSX file" and it shows a TSX file manager UI' with the TSX file you have in the same directory
You can check the performance of the .prob file that you are working on just clicking on the button "Evalute .prob file"
On the button left part you have the corpus tagger status showing how many ambiguous words still the file has
Once you finish you can click on the finish button and the .prob file will be updated with this disambiguated corpus

Interface for measure performance of .prob file[edit]

Related task:

Also, some way to evaluate performance of a .prob file

Proposed idea[edit]

For the Performance measure .prop file UI:

How it works[edit]

First of all we introduce the tagged corpus base in the left text view (if you came from the previous window you already have it)
Now you choose your .prob file
Some how (maybe by selecting or automaticatly) some part of the tagged corpus is selected to train the .prob file
Once all is set up you click on the "measure" button to measure the performance
When the measure is done, on the bottom left you get the accuracy and the information about the performance
After the measure it shows as well the output generated using the .prob file to see where are differences

TSX management file UI[edit]

Related task:

It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)

Proposed idea[edit]

For the TSX management file UI:

How it works:[edit]

First you choose a TSX file (or if you came from the tagger UI it is already inserted)
It shows all the categories of the file and the forbid, enforce and prefer rules
On the second column it says if the category is or not closed
Also you are able to add new labels and items inserting them on the right textview
Once you add new categories or rules on the textview you click on the "Apply" button and they are inserted to the TSX file and update the treeview and clean up the textview
Once you finish the edition you click on save button and the TSX file will be updated if there is some changes not applied

Constraint grammar rule manager UI[edit]

Related task:

Including a way to incorporate constraint grammar rules would also be nice.

Proposed idea[edit]

For the Constraint grammar rule manager UI:

How it works:[edit]

On the left side there is a tagged corpus to be desambiguated using some rules
On the center there is a text view to enter new rules
On the right side there si a button to execute the rules already written on the rules text view
Once you click on the execute rules button it check the taggs that bind with the definition and it update them highlithing them with colours
On the button left there is a status label to shows ambiguous word still are in the corpus
Once you finish the new rules, click on the finish button and the new rules are compiled to get into the gramar.bin file

For the last task A way to take into account automatically new multiwords / different tokenisation. I still don't know how to do it

Timetable and schedule[edit]

On the GSOC period[edit]

There are 4 main interfaces to be implemented and here is my estimation for each UI and hours to complete them:

Interface	Coding hours	TDD hours
Input corpus files UI	30-40	with TC_ui TDD
Tagger Corpus UI (and Input corpus UI)	60-70	40-50
TSX manager UI	40-50	20-30
Performance Measure .prob UI	70-80	40-50
Constraint Grammar UI	50-60	30-40

For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.

I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debugging/Documentation

gsoc week	week of the year	tasks
1	25th week	I'm still with final exams
2	26th week	Input File UI coding 30h, Tagger Corpus UI coding 5/60h
3	27th week	Tagger Corpus UI coding 35/60h
4	28th week	Tagger Corpus UI coding 20/60h + TDD 15/40h
First Deliverable
5	29th week	Tagger Corpus UI TDD 25/40h, TSX file manager UI coding 10/40h
6	30th week	TSX file manager UI coding 30/40h + TDD 5/20h
7	31st week	TSX file manager UI TDD 15/20h, Constraint Grammar UI coding 20/50h
8	32nd week	Constraint Grammar UI coding 30/50h + TDD 5/30
Second Deliverable
9	33rd week	Constraint Grammar UI TDD 25/30h, Performance measure .prob UI coding 10/70h
10	34th week	Performance measure .prob UI coding 35/70h
11	35th week	Performance measure .prob UI coding 25/70h + TDD 10/40h
12	36th week	Performance measure .prob UI TDD 30/40h, Final documentation 5h
Finalitation

Before gsoc[edit]

From 27 May to 17 of June I'll bonding with the apertium community

About the work before the gsoc:

I would like to have every part completely understood, and the technical part ready to start working directly on 27th of June (Once I have finished all my final exams)
Like I have never use text view I have to learn how to manage the colours (with tags as I have read) and somehow select the words by clicking
Also I would like to have the interfaces glade files finished before gsoc, so we have to discuss the interfaces before summer

Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-)

Why are you interested in machine translation and why in Apertium?[edit]

I didn't realize how could be made a machine translation until I found Apertium, I thought that translators as google translate would be made using really huge dictionaries, sorted in a smart way to get them searching really fast. But once I found Apertium I can't let go the oportunity of contribute in a international project where I can learn as more as possible from a new field

Why Google and Apertium should sponsor it?[edit]

I think the part of the project I'd like to work on would be one of the most important ones in order to get new people participating and colaborating on the project. Once the interfaces will be finished, usual users will be able to contribute getting more accuracy translations

How and who it will benefit in society?[edit]

Having more friendly user interfaces to training and tagging corpora, will improve the translations quality using them, and from other approach getting the dump wikipedia files translated, the society will get a lot of benefit getting more wikipedia entries translated into their mother tongue

Coding Challenge[edit]

I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I have the code and the readme with all the instructions and explanations in my github repo

code-challenge

Bio[edit]

My name is Oscar Ramirez (tuxskar on the IRC), I'm on 5th year student in Computer Science Engineering at Malaga University (Escuela Tecnica Superior de Ingeniería Informática de Málaga) I finished my bachelor in computer Science (oriented to systems) almost 2 years ago

I have work on a few open source project using C, Python, Django, PyGTK, Android and Arduino but the most important here is the last one. I'm developing a lawyer's office manager on PyGTK (here is the code on github and also on PyPi PyPi)

This is the first year applying for a place in the GSOC and I'm really excited to work with apertium because I have never wonder how to create a good translator and now I have the oportunity to contribute on build a great tool for it :D

This is the first year I have applied for gsoc and I'm really excited to work on it

@@ Line 25: / Line 25: @@
 * You are able to select a corpus file and introduce the content into the text view
 * Introduce a corpus already tagged
-* Introduce a compress dump wikipedia file
+* Introduce a dump wikipedia file stored somewhere (it downloads the file, decompress tags it and inserts it on the textview)
 * Modify (or directly enter a new corpus from scratch) the actual corpus
 * Once you finish your edition you click on the apply button and appears the corpus tagger UI
@@ Line 106: / Line 106: @@
 | Input corpus files UI
 | 30-40
+| with TC_ui TDD
-| 10-20
 |-
 | Tagger Corpus UI (and Input corpus UI)
-| 70-80
+| 60-70
-| 20-30
+| 40-50
 |-
 | TSX manager UI
 | 40-50
-| 30-40
+| 20-30
 |-
 | Performance Measure .prob UI
@@ Line 125: / Line 125: @@
 |}
-For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 350h (32h/week) and in the worst case I estimate 430h (39h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.
+For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.
-I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have stimate I can work less on the first period and finish all on time
+I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time
 There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
-TDD = Test/Debuging/Documentation
+TDD = Test/Debugging/Documentation
 {| class="wikitable" border="1"
@@ Line 153: / Line 153: @@
 | 4
 | 28th week
-| Tagger Corpus UI coding 20/60h + TDD 15/30h
+| Tagger Corpus UI coding 20/60h + TDD 15/40h
 |-
 | '''First Deliverable'''
@@ Line 159: / Line 159: @@
 | 5
 | 29th week
-| Tagger Corpus UI TDD 20/40h, TSX file manager UI coding 15/40h
+| Tagger Corpus UI TDD 25/40h, TSX file manager UI coding 10/40h
 |-
 | 6
 | 30th week
-| TSX file manager UI coding 25/40h + TDD 10/20h
+| TSX file manager UI coding 30/40h + TDD 5/20h
 |-
 | 7
 | 31st week
-| TSX file manager UI TDD 10/20h, Constraint Grammar UI coding 25/50h
+| TSX file manager UI TDD 15/20h, Constraint Grammar UI coding 20/50h
 |-
 | 8
 | 32nd week
-| Constraint Grammar UI coding 25/50h + TDD 10/20
+| Constraint Grammar UI coding 30/50h + TDD 5/30
 |-
 | '''Second Deliverable'''
@@ Line 177: / Line 177: @@
 | 9
 | 33rd week
-| Constraint Grammar UI TDD 20/30h, Performance measure .prob UI coding 15/70h
+| Constraint Grammar UI TDD 25/30h, Performance measure .prob UI coding 10/70h
 |-
 | 10
@@ Line 185: / Line 185: @@
 | 11
 | 35th week
-| Performance measure .prob UI coding 20/70h + TDD 15/40h
+| Performance measure .prob UI coding 25/70h + TDD 10/40h
 |-
 | 12
 | 36th week
-| Performance measure .prob UI TDD 25/40h, Final documentation 15h
+| Performance measure .prob UI TDD 30/40h, Final documentation 5h
 |-
 | '''Finalitation'''
@@ Line 203: / Line 203: @@
 Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-)
+== Why are you interested in machine translation and why in Apertium? ==
+I didn't realize how could be made a machine translation until I found Apertium, I thought that translators as google translate would be made using really huge dictionaries, sorted in a smart way to get them searching really fast. But once I found Apertium I can't let go the oportunity of contribute in a international project where I can learn as more as possible from a new field
+== Why Google and Apertium should sponsor it? ==
+I think the part of the project I'd like to work on would be one of the most important ones in order to get new people participating and colaborating on the project. Once the interfaces will be finished, usual users will be able to contribute getting more accuracy translations
+== How and who it will benefit in society? ==
+Having more friendly user interfaces to training and tagging corpora, will improve the translations quality using them, and from other approach getting the dump wikipedia files translated, the society will get a lot of benefit getting more wikipedia entries translated into their mother tongue
 == Coding Challenge ==
-I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I'll upload some documentation about in my github repo like finals outputs or the final .prob from the trainer unsupervised to "probe" I have done my homeworks :P
+I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I have the code and the readme with all the instructions and explanations in my github repo
+[https://github.com/tuxskar/apertium-code-challenge code-challenge]
 == Bio ==
@@ Line 216: / Line 230: @@
 This is the first year I have applied for gsoc and I'm really excited to work on it
+[[Category:GSoC_2013_Student_proposals|Tuxskar]]

Difference between revisions of "User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013"

Latest revision as of 19:10, 1 May 2013

Contents

Contact information[edit]

Tasks and proposed ideas[edit]

Interface for creating tagged corpora[edit]

Proposed idea[edit]

How it works:[edit]

How it works:[edit]

Interface for measure performance of .prob file[edit]

Proposed idea[edit]

How it works[edit]

TSX management file UI[edit]

Proposed idea[edit]

How it works:[edit]

Constraint grammar rule manager UI[edit]

Proposed idea[edit]

How it works:[edit]

Timetable and schedule[edit]

On the GSOC period[edit]

Before gsoc[edit]

Why are you interested in machine translation and why in Apertium?[edit]

Why Google and Apertium should sponsor it?[edit]

How and who it will benefit in society?[edit]

Coding Challenge[edit]

Bio[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools