User:Quirille/GSOC proposal 2014

From Apertium
Jump to navigation Jump to search

Contact information[edit]

Name: Krylov Kirill


IRC: quirille

Other contact information can be provided to the mentor.

Why is it you are interested in machine translation?[edit]

I am very interested in both linguistics and computer science which are the main constituents of machine translation. In school I had 10 years in-depth courses of English and Russian. They were one of my favorite subjects and I examined many linguistic issues (concerned not only Russian and English). Although in the university I mostly make study of programming and computer science, I keep up my passion for linguistics. I find the fields of natural language processing and machine translation very attractive and prospective and want to specialize in them.

Why is it that you are interested in the Apertium project?[edit]

The Apertium project could give me the opportunity to be engaged in the field of machine translation. In addition, Apertium is open source which is very interesting approach to the software development. Also Apertium has many tasks which are so amazing to be realized.

Which of the published tasks are you interested in? What do you plan to do?[edit]


Prototype recursive transfer implementations

Reasons why Google and Apertium should sponsor it[edit]

Currently there is a problem with very distantly related languages that have long-distance constituent reordering, because Apertium can only do finite-state chunking. Performing this task will enhance the transfer module in Apertium by adding a support of the recursive transfer rules.

A description of how and who it will benefit in society[edit]

Performing this task will give to language pair developers the ability to write more advanced and compact transfer rules for the transfer stage. And also this task will possibly benefit end users of Apertium if there will be an improvement of the translation quality for language pairs with the news rules.

Work plan[edit]

Existing progress

Community bonding period (till May 18):

  • Do a review of the literature on parsing in transfer-based MT (LALR(1))

Work Period (May 19 - August 18)

Week 1:

  • Develop a transfer rule format in XML that can express transfer operations as a grammar of patterns of lexical units (lemma/tag sequences) (bottom-up parsing).

Week 2, 3, 4:

  • Write a prototype in Python of an interpreter for the format which applies the rules on the output of the lexical transfer module (biltrans).

Week 5:

  • Write documentation

Deliverable #1: new transfer rule format and interpreter prototype.

Week 6, 7 (Week 6 - Midterm: June 23 - June 27):

  • Write a number of transfer rules in this format for translating between a language pair.
  • Reimplement an existing language pair in trunk using the new format. This will involve rewriting the existing rules to be compatible with the new format.

Week 8:

  • Integrate the new rules into the existing pair.

Week 9:

  • Compare with existing rules.
  • Evaluate an improvement in translation quality when comparing with existing transfer.
  • Show that shorter rule files can be written using the new formalism with the same or better results.

Deliverable #2: reinmplemented transfer rules for one language pair.

Week 10, 11, 12:

  • Implement a final version of the parser in C++.

Week 13:

  • Write documentation

Deliverable #3: final interpreter implementation.

Project completion (August 11 (Suggested 'pencils down' date) - August 18):

  • Tidying up

Final evaluation (August 18 (Firm 'pencils down' date) - August 22)

Current progress is documented here:

List your skills and give evidence of your qualifications[edit]

I am on the 5th (last) year of the spetsialist (специалист, russian degree between Bachelor's and Master's) degree in Computer Science and Engineering at the Institute of Management and Information Technologies of the Saint Petersburg State Polytechnical University (Russia).

Programming skills: C, C++, C# and .NET, Matlab, Python, Linux shell script, PHP.

In the institute I had courses of Machine Learning and Automata Theory. Knowledge of them will help me to fulfil the project. Also I have done some works concerned NLP during my studies. As a course paper of Machine Learning discipline I wrote text attribution program in Matlab based on Bag of Words approach and machine learning algorithms (using libraries randomforest-matlab by Abhishek Jaiantilal and libsvm). As a course paper of Machine Vision discipline I wrote C# program for image classification based on Bag of Words model and SVM algorithm (using EmguCV – C# wrapper of OpenCV).

Also I worked in company Mallenom Systems attached to our institute as a tester in 2 projects: traffic simulation system Road Manager and program complex Automated rolling stock car identification system ARSCIS. This job gave me team-working skills, knowledge of such a great program as git and helped me to look at the programmers’ job “from the other side of the barricade”.

Previous year I worked with Apertium and participated in the development of Ukrainian-Russian machine translation system. There I get the understanding of how Apertium works, particularly its modules: morphological analyser, morphological disambiguation with CG, POS tagger, lexical selection, lexical transfer, structural transfer, morphological generator.

List any non-Summer-of-Code plans you have for the Summer[edit]

I have no non-GSoC plans for the summer and can contribute from 30 to 40 hours a week.


[1] -

[2] -