Difference between revisions of "User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013"
(7 intermediate revisions by 2 users not shown) | |||
Line 25: | Line 25: | ||
* You are able to select a corpus file and introduce the content into the text view |
* You are able to select a corpus file and introduce the content into the text view |
||
* Introduce a corpus already tagged |
* Introduce a corpus already tagged |
||
− | * Introduce a |
+ | * Introduce a dump wikipedia file stored somewhere (it downloads the file, decompress tags it and inserts it on the textview) |
* Modify (or directly enter a new corpus from scratch) the actual corpus |
* Modify (or directly enter a new corpus from scratch) the actual corpus |
||
* Once you finish your edition you click on the apply button and appears the corpus tagger UI |
* Once you finish your edition you click on the apply button and appears the corpus tagger UI |
||
Line 106: | Line 106: | ||
| Input corpus files UI |
| Input corpus files UI |
||
| 30-40 |
| 30-40 |
||
+ | | with TC_ui TDD |
||
− | | 10-20 |
||
|- |
|- |
||
| Tagger Corpus UI (and Input corpus UI) |
| Tagger Corpus UI (and Input corpus UI) |
||
− | | |
+ | | 60-70 |
− | | |
+ | | 40-50 |
|- |
|- |
||
| TSX manager UI |
| TSX manager UI |
||
| 40-50 |
| 40-50 |
||
− | | |
+ | | 20-30 |
|- |
|- |
||
| Performance Measure .prob UI |
| Performance Measure .prob UI |
||
Line 125: | Line 125: | ||
|} |
|} |
||
− | For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is |
+ | For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation. |
− | I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have |
+ | I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time |
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h |
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h |
||
− | TDD = Test/ |
+ | TDD = Test/Debugging/Documentation |
{| class="wikitable" border="1" |
{| class="wikitable" border="1" |
||
Line 153: | Line 153: | ||
| 4 |
| 4 |
||
| 28th week |
| 28th week |
||
− | | Tagger Corpus UI coding 20/60h + TDD 15/ |
+ | | Tagger Corpus UI coding 20/60h + TDD 15/40h |
|- |
|- |
||
| '''First Deliverable''' |
| '''First Deliverable''' |
||
Line 159: | Line 159: | ||
| 5 |
| 5 |
||
| 29th week |
| 29th week |
||
− | | Tagger Corpus UI TDD |
+ | | Tagger Corpus UI TDD 25/40h, TSX file manager UI coding 10/40h |
|- |
|- |
||
| 6 |
| 6 |
||
| 30th week |
| 30th week |
||
− | | TSX file manager UI coding |
+ | | TSX file manager UI coding 30/40h + TDD 5/20h |
|- |
|- |
||
| 7 |
| 7 |
||
| 31st week |
| 31st week |
||
− | | TSX file manager UI TDD |
+ | | TSX file manager UI TDD 15/20h, Constraint Grammar UI coding 20/50h |
|- |
|- |
||
| 8 |
| 8 |
||
| 32nd week |
| 32nd week |
||
− | | Constraint Grammar UI coding |
+ | | Constraint Grammar UI coding 30/50h + TDD 5/30 |
|- |
|- |
||
| '''Second Deliverable''' |
| '''Second Deliverable''' |
||
Line 177: | Line 177: | ||
| 9 |
| 9 |
||
| 33rd week |
| 33rd week |
||
− | | Constraint Grammar UI TDD |
+ | | Constraint Grammar UI TDD 25/30h, Performance measure .prob UI coding 10/70h |
|- |
|- |
||
| 10 |
| 10 |
||
Line 185: | Line 185: | ||
| 11 |
| 11 |
||
| 35th week |
| 35th week |
||
− | | Performance measure .prob UI coding |
+ | | Performance measure .prob UI coding 25/70h + TDD 10/40h |
|- |
|- |
||
| 12 |
| 12 |
||
| 36th week |
| 36th week |
||
− | | Performance measure .prob UI TDD |
+ | | Performance measure .prob UI TDD 30/40h, Final documentation 5h |
|- |
|- |
||
| '''Finalitation''' |
| '''Finalitation''' |
||
Line 203: | Line 203: | ||
Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-) |
Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-) |
||
+ | |||
+ | == Why are you interested in machine translation and why in Apertium? == |
||
+ | |||
+ | I didn't realize how could be made a machine translation until I found Apertium, I thought that translators as google translate would be made using really huge dictionaries, sorted in a smart way to get them searching really fast. But once I found Apertium I can't let go the oportunity of contribute in a international project where I can learn as more as possible from a new field |
||
+ | |||
+ | == Why Google and Apertium should sponsor it? == |
||
+ | |||
+ | I think the part of the project I'd like to work on would be one of the most important ones in order to get new people participating and colaborating on the project. Once the interfaces will be finished, usual users will be able to contribute getting more accuracy translations |
||
+ | |||
+ | == How and who it will benefit in society? == |
||
+ | |||
+ | Having more friendly user interfaces to training and tagging corpora, will improve the translations quality using them, and from other approach getting the dump wikipedia files translated, the society will get a lot of benefit getting more wikipedia entries translated into their mother tongue |
||
== Coding Challenge == |
== Coding Challenge == |
||
− | I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I |
+ | I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I have the code and the readme with all the instructions and explanations in my github repo |
+ | |||
+ | [https://github.com/tuxskar/apertium-code-challenge code-challenge] |
||
== Bio == |
== Bio == |
||
Line 216: | Line 230: | ||
This is the first year I have applied for gsoc and I'm really excited to work on it |
This is the first year I have applied for gsoc and I'm really excited to work on it |
||
+ | |||
+ | [[Category:GSoC_2013_Student_proposals|Tuxskar]] |
Latest revision as of 19:10, 1 May 2013
Contents
- 1 Contact information
- 2 Tasks and proposed ideas
- 3 Timetable and schedule
- 4 Why are you interested in machine translation and why in Apertium?
- 5 Why Google and Apertium should sponsor it?
- 6 How and who it will benefit in society?
- 7 Coding Challenge
- 8 Bio
Contact information[edit]
Name: Oscar Ramirez Jimenez
Email: tuxskar@gmail.com
IRC: tuxskar
Github repo: http://github.com/tuxskar
Tasks and proposed ideas[edit]
Interface for creating tagged corpora[edit]
Related tasks: Interface for creating tagged corpora
- Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
- The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
- And, a user-friendly interface to train a supervised tagger
Proposed idea[edit]
To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions)
The first interface is really simple, it is the Input corpus file UI:
How it works:[edit]
- You are able to select a corpus file and introduce the content into the text view
- Introduce a corpus already tagged
- Introduce a dump wikipedia file stored somewhere (it downloads the file, decompress tags it and inserts it on the textview)
- Modify (or directly enter a new corpus from scratch) the actual corpus
- Once you finish your edition you click on the apply button and appears the corpus tagger UI
After insert a file and click apply it tag with the corpus (if it isn't yet) and shows the supervised corpus tagger UI:
How it works:[edit]
- It shows just the tags word for the ambiguous ones
- For each ambiguous word it shows the multiple options that you can select (using a short-cut and the mouse)
- You can go to the next and previous word by clicking on the left buttons or with a short-cut as well
- You can see the TSX file clicking on the button "check the TSX file" and it shows a TSX file manager UI' with the TSX file you have in the same directory
- You can check the performance of the .prob file that you are working on just clicking on the button "Evalute .prob file"
- On the button left part you have the corpus tagger status showing how many ambiguous words still the file has
- Once you finish you can click on the finish button and the .prob file will be updated with this disambiguated corpus
Interface for measure performance of .prob file[edit]
Related task:
- Also, some way to evaluate performance of a .prob file
Proposed idea[edit]
For the Performance measure .prop file UI:
How it works[edit]
- First of all we introduce the tagged corpus base in the left text view (if you came from the previous window you already have it)
- Now you choose your .prob file
- Some how (maybe by selecting or automaticatly) some part of the tagged corpus is selected to train the .prob file
- Once all is set up you click on the "measure" button to measure the performance
- When the measure is done, on the bottom left you get the accuracy and the information about the performance
- After the measure it shows as well the output generated using the .prob file to see where are differences
TSX management file UI[edit]
Related task:
- It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)
Proposed idea[edit]
For the TSX management file UI:
How it works:[edit]
- First you choose a TSX file (or if you came from the tagger UI it is already inserted)
- It shows all the categories of the file and the forbid, enforce and prefer rules
- On the second column it says if the category is or not closed
- Also you are able to add new labels and items inserting them on the right textview
- Once you add new categories or rules on the textview you click on the "Apply" button and they are inserted to the TSX file and update the treeview and clean up the textview
- Once you finish the edition you click on save button and the TSX file will be updated if there is some changes not applied
Constraint grammar rule manager UI[edit]
Related task:
- Including a way to incorporate constraint grammar rules would also be nice.
Proposed idea[edit]
For the Constraint grammar rule manager UI:
How it works:[edit]
- On the left side there is a tagged corpus to be desambiguated using some rules
- On the center there is a text view to enter new rules
- On the right side there si a button to execute the rules already written on the rules text view
- Once you click on the execute rules button it check the taggs that bind with the definition and it update them highlithing them with colours
- On the button left there is a status label to shows ambiguous word still are in the corpus
- Once you finish the new rules, click on the finish button and the new rules are compiled to get into the gramar.bin file
For the last task A way to take into account automatically new multiwords / different tokenisation. I still don't know how to do it
Timetable and schedule[edit]
On the GSOC period[edit]
There are 4 main interfaces to be implemented and here is my estimation for each UI and hours to complete them:
Interface | Coding hours | TDD hours |
---|---|---|
Input corpus files UI | 30-40 | with TC_ui TDD |
Tagger Corpus UI (and Input corpus UI) | 60-70 | 40-50 |
TSX manager UI | 40-50 | 20-30 |
Performance Measure .prob UI | 70-80 | 40-50 |
Constraint Grammar UI | 50-60 | 30-40 |
For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.
I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
TDD = Test/Debugging/Documentation
gsoc week | week of the year | tasks |
---|---|---|
1 | 25th week | I'm still with final exams |
2 | 26th week | Input File UI coding 30h, Tagger Corpus UI coding 5/60h |
3 | 27th week | Tagger Corpus UI coding 35/60h |
4 | 28th week | Tagger Corpus UI coding 20/60h + TDD 15/40h |
First Deliverable | ||
5 | 29th week | Tagger Corpus UI TDD 25/40h, TSX file manager UI coding 10/40h |
6 | 30th week | TSX file manager UI coding 30/40h + TDD 5/20h |
7 | 31st week | TSX file manager UI TDD 15/20h, Constraint Grammar UI coding 20/50h |
8 | 32nd week | Constraint Grammar UI coding 30/50h + TDD 5/30 |
Second Deliverable | ||
9 | 33rd week | Constraint Grammar UI TDD 25/30h, Performance measure .prob UI coding 10/70h |
10 | 34th week | Performance measure .prob UI coding 35/70h |
11 | 35th week | Performance measure .prob UI coding 25/70h + TDD 10/40h |
12 | 36th week | Performance measure .prob UI TDD 30/40h, Final documentation 5h |
Finalitation |
Before gsoc[edit]
From 27 May to 17 of June I'll bonding with the apertium community
About the work before the gsoc:
- I would like to have every part completely understood, and the technical part ready to start working directly on 27th of June (Once I have finished all my final exams)
- Like I have never use text view I have to learn how to manage the colours (with tags as I have read) and somehow select the words by clicking
- Also I would like to have the interfaces glade files finished before gsoc, so we have to discuss the interfaces before summer
Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-)
Why are you interested in machine translation and why in Apertium?[edit]
I didn't realize how could be made a machine translation until I found Apertium, I thought that translators as google translate would be made using really huge dictionaries, sorted in a smart way to get them searching really fast. But once I found Apertium I can't let go the oportunity of contribute in a international project where I can learn as more as possible from a new field
Why Google and Apertium should sponsor it?[edit]
I think the part of the project I'd like to work on would be one of the most important ones in order to get new people participating and colaborating on the project. Once the interfaces will be finished, usual users will be able to contribute getting more accuracy translations
How and who it will benefit in society?[edit]
Having more friendly user interfaces to training and tagging corpora, will improve the translations quality using them, and from other approach getting the dump wikipedia files translated, the society will get a lot of benefit getting more wikipedia entries translated into their mother tongue
Coding Challenge[edit]
I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I have the code and the readme with all the instructions and explanations in my github repo
Bio[edit]
My name is Oscar Ramirez (tuxskar on the IRC), I'm on 5th year student in Computer Science Engineering at Malaga University (Escuela Tecnica Superior de Ingeniería Informática de Málaga) I finished my bachelor in computer Science (oriented to systems) almost 2 years ago
I have work on a few open source project using C, Python, Django, PyGTK, Android and Arduino but the most important here is the last one. I'm developing a lawyer's office manager on PyGTK (here is the code on github and also on PyPi PyPi)
This is the first year applying for a place in the GSOC and I'm really excited to work with apertium because I have never wonder how to create a good translator and now I have the oportunity to contribute on build a great tool for it :D
This is the first year I have applied for gsoc and I'm really excited to work on it