!!! Work in progress !!!

Contact details

Name: Alexandru-Marian Florescu
e-mail: acdc152@gmail.com
IRC: AlexFlower
I also try to stay on the IRC channel as much as possible, so you can find me there most of the time.

Interest in machine translation

Giving my passion for computers and programming, it makes sense that I’d be interested in machine translating as well. I find it very interesting how computers, although using a slightly different process, can still make pretty accurate translations. I also find working on this, great way of training my brain into thinking more freely and more general. I have also worked on MT before, and I found it to be a very pleasing experience. (although, MT requires a lot of work, usually)

Interest in Apertium

Apertium is, in my opinion, the best open-source machine translator available at the moment. Given my interest in machine translation is normal that I will be interested in Apertium as well. I have worked here before, as a GCI participant and I believe I’ve learned a lot and made new friends, so I hope I can repeat the experience this year, as a GSoC participant.

Concerning the project, I’m very interested in the complex multiwords compiler. I believe machine translation needs to step forward, and this is one of the ways to do it. I'm confident I can improve the way we deal with multiwords at the moment, and thus make an impact on the scope of Apertium, and the quality of the translations it makes.

Complex multiwords compiler

Reasons for Google and Apertium to sponsor it

Apertium is a MT that prioritizes the accuracy of the translation, but such precision can't be reached unless more work is invested into the actual translation process. One of the issues at the moment is the compiler dealing with different types of multiwords, which is not consistent enough, yet. Enhancing the multiwords compiler, will make Apertium’s impact much bigger. It will help by making much easier to adopt a new language pair. It will also greatly improve translations quality for most language pairs.

Who and how will benefit from this

Well, first of all, basically everyone who uses Apertium for it's main purpose will greatly benefit from this. Also, being an Open-Source project, the resulting code can become a good research material, for students studying formal languages or MT, giving how the possibilities are rather limited at the moment.

Work plan

Now: reading code, trying new tricks to make the project more efficient, and getting to know the mentor better.
Community bonding phase: Try to discover any possible exception case to what I'm trying to build, and consider them, get more familiar with the project and the community.

Week 1 : Adapt the compiler for the first pass through entries (LR or RL)
Week 2 : Finish work on the first pass
Week 3 : Implement the second pass through the entries (as in-memory-only analysis-as-generation)
Week 4 : Continue week 3

Deliverable #1: adapted compiler for the first pass, almost able to do the second pass aswell.

Week 5 : Continue week 4
Week 6 : Finish work on the second pass algorithm
Week 7 : Create a dedicated section in the dictionary for the newly introduced multiwords
Week 8 : Add the generated templates (from the work done in weeks 2-3) to the dictionary

Deliverable #2: Working enhanced compiler, able to deal with any kind of multiwords

Week 9 : Make complete tests of the entire project. Make sure it creates no conflicts.
Week 10: Buffer time, for unexpected issues or errors that might come up. Or for additional features we might want to add, if time allows.
Week 11: Buffer time, for unexpected issues or errors that might come up. Or for additional features we might want to add, if time allows.
Week 12: Clear code, write documentation and prepare for integration

Deliverable #3:Working enhanced and ready to integrate compiler, able to deal with any kind of multiwords

Technical aspects

Multiwords can be split into 4 categories, and we have to deal with them separatedly for each category.

Uninflected
- Which the compiler already deals with just fine. Not much work to be done here.
Final inflection
- The compiler does good with those too.
Inner inflection
- Similar to the Final inflections, already taken care of in lttoolbox.
Multiple inflection ('force majeure'/'forces majeures')
- Now this is where things get interesting. This is what the project will mostly focus on.

Split the multiwords
This basically means that we'll split the multiword into tokens, and deal with them separatedly. Having done otherwise, the process of recognising these multiwords would have been much too long. Basically, the compiler needs to visit all analyses. Which comes down to adding a pair of state queues to the current tokenisation scheme.

Generating multiwords basically requires generating them from a template, using the forms from the dictionary. Which basically means we'll have the individual entries in the dictionary, as well as LR forms.

There is more than one way to get the forms, but the most realiable one seems to be through a double compilation of entries: once LR or RL, and one as in-memory-only analysis-as-generation.

Skills and qualifications

I am a Computer Science student at the University of Bucharest. I have participated in Google Code-In 3 years in a row, winning the Grand Prize once. I have programmed in different Open Source organisations like: Gnome, KDE, Sahana etc. as well as Apertium. I am also employed as a software developer at a local company. I have programmed in : C, C++, C#, Java, Python, PHP, Javascript. Also, I am fairly experienced with XML, having worked on tasks involving this on Apertium itself and outside of Apertium, on different occasions.

Although I am already familiar with them, I think it's worthy to mention that I am studying both C++ and Formal Languages this year, so it's safe to say that I will have no real issue implementing my ideeas.

Having worked here before as a student participating in GCI, I strongly believe I learned a lot, and I feel like GSoC is the next step to make. It's like a task, just much bigger, and more interesting.

This is the coding challenge I have done. It's a simple c++ script that generates bigrams from an inputed string: [1]

Other plans for the summer

My time at the moment is divided between school and my part-time job (20hrs/week). By the end of June, school will be over for the summer. Also, if I am to be accepted, I will pause or drop my job for the coding period, so I can be 100% dedicated to achieving my goals. Which means that I will be available for work around 40hrs/week, maybe even more, if I am to encounted unexpected issues.

User:AlexMetalhead/Application GSoC 2014

Contents

Contact details

Interest in machine translation

Interest in Apertium

Complex multiwords compiler

Reasons for Google and Apertium to sponsor it

Who and how will benefit from this

Work plan

Technical aspects

Skills and qualifications

Other plans for the summer

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools