User:AlexMetalhead/Application GSoC 2014
!!! Work in progress !!!
Contents
Contact details
Name: Alexandru-Marian Florescu
e-mail: acdc152@gmail.com
IRC: AlexFlower
I also try to stay on the IRC channel as much as possible, so you can find me there most of the time.
Interest in machine translation
Giving my passion for computers and programming, it makes sense that I’d be interested in machine translating as well. I find it very interesting how computers, although using a slightly different process, can still make pretty accurate translations. I also find working on this, great way of training my brain into thinking more freely and more general. I have also worked on MT before, and I found it to be a very pleasing experience. (although, MT requires a lot of work, usually)
Interest in Apertium
Apertium is, in my opinion, the best open-source machine translator available at the moment. Given my interest in machine translation is normal that I will be interested in Apertium as well. I have worked here before, as a GCI participant and I believe I’ve learned a lot and made new friends, so I hope I can repeat the experience this year, as a GSoC participant.
Concerning the project, I’m very interested in the complex multiwords compiler. I believe machine translation needs to step forward, and this is one of the ways to do it. I'm confident I can improve the way we deal with multiwords at the moment, and thus make an impact on the scope of Apertium, and the quality of the translations it makes.
Complex multiwords compiler
Reasons for Google and Apertium to sponsor it
Apertium is a MT that prioritizes the accuracy of the translation, but such precision can't be reached unless more work is invested into the actual translation process. One of the issues at the moment is the compiler dealing with different types of multiwords, which is not consistent enough, yet. Enhancing the multiwords compiler, will make Apertium’s impact much bigger. It will help by making much easier to adopt a new language pair. It will also greatly improve translations quality for most language pairs.
Who and how will benefit from this
Well, first of all, basically everyone who uses Apertium for it's main purpose will greatly benefit from this. Also, being an Open-Source project, the resulting code can become a good research material, for students studying formal languages or MT, giving how the possibilities are rather limited at the moment.
Work plan
Now: reading code, trying new tricks to make the project more efficient, and getting to know the mentor better.
Community bonding phase: Try to discover any possible exception case to what I'm trying to build, and consider them, get more familiar with the project and the community.
- Week 1 : Adapt the compiler for the first pass through entries (LR or RL)
- Week 2 : Finish work on the first pass
- Week 3 : Implement the second pass through the entries (as in-memory-only analysis-as-generation)
- Week 4 : Continue week 3
Deliverable #1: adapted compiler for the first pass, almost able to do the second pass aswell.
- Week 5 : Continue week 4
- Week 6 : Finish work on the second pass algorithm
- Week 7 : Create a dedicated section in the dictionary for the newly introduced multiwords
- Week 8 : Add the generated templates (from the work done in weeks 2-3) to the dictionary
Deliverable #2: Working enhanced compiler, able to deal with any kind of multiwords
- Week 9 : Make complete tests of the entire project. Make sure it creates no conflicts.
- Week 10: Buffer time, for unexpected issues or errors that might come up. Or for additional features we might want to add, if time allows.
- Week 11: Buffer time, for unexpected issues or errors that might come up. Or for additional features we might want to add, if time allows.
- Week 12: Clear code, write documentation and prepare for integration
Deliverable #3:Working enhanced and ready to integrate compiler, able to deal with any kind of multiwords
Technical aspects
Multiwords can be split into 4 categories, and we have to deal with them separatedly for each category.
- Uninflected
- Which the compiler already deals with just fine. Not much work to be done here.
- Final inflection
- The compiler does good with those too.
- Inner inflection
- Similar to the Final inflections, already taken care of in lttoolbox.
- Multiple inflection ('force majeure'/'forces majeures')
- Now this is where things get interesting. This is what the project will mostly focus on.
Split the multiwords
This basically means that we'll split the multiword into tokens, and deal with them separatedly. Having done otherwise, the process of recognising these multiwords would have been much too long. Basically, the compiler needs to visit all analyses. Which comes down to adding a pair of state queues to the current tokenisation scheme.
Generating multiwords basically requires generating them from a template, using the forms from the dictionary. Which basically means we'll have the individual entries in the dictionary, as well as LR forms.
There is more than one way to get the forms, but the most realiable one seems to be through a double compilation of entries: once LR or RL, and one as in-memory-only analysis-as-generation.
Skills and qualifications
I am a Computer Science student at the University of Bucharest. I have participated in Google Code-In 3 years in a row, winning the Grand Prize once. I have programmed in different Open Source organisations like: Gnome, KDE, Sahana etc. as well as Apertium. I am also employed as a software developer at a local company. I have programmed in : C, C++, C#, Java, Python, PHP, Javascript. Also, I am fairly experienced with XML, having worked on tasks involving this on Apertium itself and outside of Apertium, on different occasions.
Although I am already familiar with them, I think it's worthy to mention that I am studying both C++ and Formal Languages this year, so it's safe to say that I will have no real issue implementing my ideeas.
Having worked here before as a student participating in GCI, I strongly believe I learned a lot, and I feel like GSoC is the next step to make. It's like a task, just much bigger, and more interesting.
This is the coding challenge I have done. It's a simple c++ script that generates bigrams from an inputed string: [1]
Other plans for the summer
My time at the moment is divided between school and my part-time job (20hrs/week). By the end of June, school will be over for the summer. Also, if I am to be accepted, I will pause or drop my job for the coding period, so I can be 100% dedicated to achieving my goals. Which means that I will be available for work around 40hrs/week, maybe even more, if I am to encounted unexpected issues.