User:Tuxskar/Application for "Interface for creating tagged corpora" GSOC 2013

From Apertium
Jump to navigation Jump to search

Contact information[edit]

Name: Oscar Ramirez Jimenez

Email: tuxskar@gmail.com

IRC: tuxskar

Github repo: http://github.com/tuxskar

Tasks and proposed ideas[edit]

Interface for creating tagged corpora[edit]

Related tasks: Interface for creating tagged corpora

  • Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
  • The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
  • And, a user-friendly interface to train a supervised tagger

Proposed idea[edit]

To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions)

The first interface is really simple, it is the Input corpus file UI:

Input corpus file UI

How it works:[edit]

  • You are able to select a corpus file and introduce the content into the text view
  • Introduce a corpus already tagged
  • Introduce a dump wikipedia file stored somewhere (it downloads the file, decompress tags it and inserts it on the textview)
  • Modify (or directly enter a new corpus from scratch) the actual corpus
  • Once you finish your edition you click on the apply button and appears the corpus tagger UI


After insert a file and click apply it tag with the corpus (if it isn't yet) and shows the supervised corpus tagger UI:

Supervised corpus tagger UI V3

How it works:[edit]

  • It shows just the tags word for the ambiguous ones
  • For each ambiguous word it shows the multiple options that you can select (using a short-cut and the mouse)
  • You can go to the next and previous word by clicking on the left buttons or with a short-cut as well
  • You can see the TSX file clicking on the button "check the TSX file" and it shows a TSX file manager UI' with the TSX file you have in the same directory
  • You can check the performance of the .prob file that you are working on just clicking on the button "Evalute .prob file"
  • On the button left part you have the corpus tagger status showing how many ambiguous words still the file has
  • Once you finish you can click on the finish button and the .prob file will be updated with this disambiguated corpus

Interface for measure performance of .prob file[edit]

Related task:

  • Also, some way to evaluate performance of a .prob file

Proposed idea[edit]

For the Performance measure .prop file UI:

Performance Measure prob UI.png

How it works[edit]

  • First of all we introduce the tagged corpus base in the left text view (if you came from the previous window you already have it)
  • Now you choose your .prob file
  • Some how (maybe by selecting or automaticatly) some part of the tagged corpus is selected to train the .prob file
  • Once all is set up you click on the "measure" button to measure the performance
  • When the measure is done, on the bottom left you get the accuracy and the information about the performance
  • After the measure it shows as well the output generated using the .prob file to see where are differences

TSX management file UI[edit]

Related task:

  • It should also have a user-friendly system for improving the TSX file (refine coarse tags, write rules)

Proposed idea[edit]

For the TSX management file UI:

TSX manager UI.png

How it works:[edit]

  • First you choose a TSX file (or if you came from the tagger UI it is already inserted)
  • It shows all the categories of the file and the forbid, enforce and prefer rules
  • On the second column it says if the category is or not closed
  • Also you are able to add new labels and items inserting them on the right textview
  • Once you add new categories or rules on the textview you click on the "Apply" button and they are inserted to the TSX file and update the treeview and clean up the textview
  • Once you finish the edition you click on save button and the TSX file will be updated if there is some changes not applied

Constraint grammar rule manager UI[edit]

Related task:

  • Including a way to incorporate constraint grammar rules would also be nice.

Proposed idea[edit]

For the Constraint grammar rule manager UI:

Constraint grammar UI.png

How it works:[edit]

  • On the left side there is a tagged corpus to be desambiguated using some rules
  • On the center there is a text view to enter new rules
  • On the right side there si a button to execute the rules already written on the rules text view
  • Once you click on the execute rules button it check the taggs that bind with the definition and it update them highlithing them with colours
  • On the button left there is a status label to shows ambiguous word still are in the corpus
  • Once you finish the new rules, click on the finish button and the new rules are compiled to get into the gramar.bin file

For the last task A way to take into account automatically new multiwords / different tokenisation. I still don't know how to do it


Timetable and schedule[edit]

On the GSOC period[edit]

There are 4 main interfaces to be implemented and here is my estimation for each UI and hours to complete them:

Interface Coding hours TDD hours
Input corpus files UI 30-40 with TC_ui TDD
Tagger Corpus UI (and Input corpus UI) 60-70 40-50
TSX manager UI 40-50 20-30
Performance Measure .prob UI 70-80 40-50
Constraint Grammar UI 50-60 30-40

For the every interface I wrote 2 estimated amount of hours to don't have any surprise with the time, with the minimum time the total amount of hours is 380h (34h/week) and in the worst case I estimate 450h (41h/week), I will work more if I need but I think I have estimate more hours that what I'll actually use to complete the work, but always I can improve the documentation.

I have a problem with the first week because I'm still on the final exam period until 26th of June, so as I have estimate I can work less on the first period and finish all on time

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debugging/Documentation

gsoc week week of the year tasks
1 25th week I'm still with final exams
2 26th week Input File UI coding 30h, Tagger Corpus UI coding 5/60h
3 27th week Tagger Corpus UI coding 35/60h
4 28th week Tagger Corpus UI coding 20/60h + TDD 15/40h
First Deliverable
5 29th week Tagger Corpus UI TDD 25/40h, TSX file manager UI coding 10/40h
6 30th week TSX file manager UI coding 30/40h + TDD 5/20h
7 31st week TSX file manager UI TDD 15/20h, Constraint Grammar UI coding 20/50h
8 32nd week Constraint Grammar UI coding 30/50h + TDD 5/30
Second Deliverable
9 33rd week Constraint Grammar UI TDD 25/30h, Performance measure .prob UI coding 10/70h
10 34th week Performance measure .prob UI coding 35/70h
11 35th week Performance measure .prob UI coding 25/70h + TDD 10/40h
12 36th week Performance measure .prob UI TDD 30/40h, Final documentation 5h
Finalitation

Before gsoc[edit]

From 27 May to 17 of June I'll bonding with the apertium community

About the work before the gsoc:

  • I would like to have every part completely understood, and the technical part ready to start working directly on 27th of June (Once I have finished all my final exams)
  • Like I have never use text view I have to learn how to manage the colours (with tags as I have read) and somehow select the words by clicking
  • Also I would like to have the interfaces glade files finished before gsoc, so we have to discuss the interfaces before summer

Following this schedule I think we have enaugh time to finish it even a week or so early in the worst case :-)

Why are you interested in machine translation and why in Apertium?[edit]

I didn't realize how could be made a machine translation until I found Apertium, I thought that translators as google translate would be made using really huge dictionaries, sorted in a smart way to get them searching really fast. But once I found Apertium I can't let go the oportunity of contribute in a international project where I can learn as more as possible from a new field

Why Google and Apertium should sponsor it?[edit]

I think the part of the project I'd like to work on would be one of the most important ones in order to get new people participating and colaborating on the project. Once the interfaces will be finished, usual users will be able to contribute getting more accuracy translations

How and who it will benefit in society?[edit]

Having more friendly user interfaces to training and tagging corpora, will improve the translations quality using them, and from other approach getting the dump wikipedia files translated, the society will get a lot of benefit getting more wikipedia entries translated into their mother tongue

Coding Challenge[edit]

I have finished the Coding Challenge (and I have send the "PCRE error" to the mail list) I have the code and the readme with all the instructions and explanations in my github repo

code-challenge

Bio[edit]

My name is Oscar Ramirez (tuxskar on the IRC), I'm on 5th year student in Computer Science Engineering at Malaga University (Escuela Tecnica Superior de Ingeniería Informática de Málaga) I finished my bachelor in computer Science (oriented to systems) almost 2 years ago

I have work on a few open source project using C, Python, Django, PyGTK, Android and Arduino but the most important here is the last one. I'm developing a lawyer's office manager on PyGTK (here is the code on github and also on PyPi PyPi)

This is the first year applying for a place in the GSOC and I'm really excited to work with apertium because I have never wonder how to create a good translator and now I have the oportunity to contribute on build a great tool for it :D

This is the first year I have applied for gsoc and I'm really excited to work on it