User:N0nick/Application

From Apertium
< User:N0nick
Revision as of 20:05, 4 June 2011 by N0nick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Apertium Summer of Code application: 
New Maltese-Hebrew language pair

About Me

Name: Sagie Maoz

E-mail address: sagie@maoz.info

Other contact information:

  • Phone: +972 (52) 834-3339
  • IRC handle (freenode): n0nick
  • Jabber: sagiem@gmail.com
  • Alternative e-mail address: n0nick@php.net

Why is it you are interested in machine translation?

I see translation in general as a very crucial part of culture interfaces, bringing knowledge and insights from one part of the world to another. Even in today’s very online and inter-connected world, the spread of information is still limited by a language barrier.
Machine translation has the power to make obtaining knowledge simpler and more available. Living in a bi-lingual country and working in another (English), I use MT solutions on a daily basis. Also, being a programmer and a computer science student, I find the theory behind the technology fascinating.

Why is it that they are interested in the Apertium project?

I’m a big believer in open-source software and I have been contributing for such projects for years. I’m particularly excited about the spirit of openness and freedom behind open source.
This is extremely true for a MT project such as Apertium, that allows people worldwide to use the translation features and learn about the process and about new languages.
In my very short experience working and developing with the Apertium team, I found a warm and friendly community and a very interesting development workflow.

Which of the published tasks are you interested in? What do you plan to do?

Title

Apertium-mt-he
New Maltese-Hebrew language pair, providing unidirectional translation of Maltese → Hebrew.

Reasons why Google and Apertium should sponsor it

Apertium currently has no language pair for neither Hebrew nor Maltese.
While Google Translate does support a Maltese-Hebrew translation, its quality is often far from satisfying.
I tested a sample of Maltese proverbs[1] and found some consistencies in failures to translate certain parts of speech such as tense, number and negation.

For example:

Ġurdien xiħ ma jiekolx ġobon
mouse. old. not. eat. cheese
“An old mouse does not eat cheese”

was translated into:

עכבר רגיל לאכול גבינה
axbar. ragil. le’exol. gvina
mouse. used. to-eat. cheese

instead of:

עכבר זקן לא אוכל גבינה
axbar. zaken. lo. oxel. gvina
mouse. old. not. eat. cheese

(Although the proverb’s actual meaning didn’t transfer, the literal translation preserves the metaphor).
My opinion is that a rule-based MT system based on Apertium could produce far better results using insights about the syntax and grammar of Maltese and Hebrew.

Furthermore, Apertium has no release-quality language pair dealing with any Semitic language.
Working on such a pair would most likely point to new insights and understandings about the project’s architecture and its ability to support other Semitic languages. Thus, the work would result in easier extension to support this language family, which is fairly important for translation projects.

How and who it will benefit in society

Both languages Maltese and Hebrew share a very similar history[2], and in particular both have been artificially resurrected towards the end of the 19th century.
It seems very likely that having such translation project as proposed would result in very interesting insights about the evolution of both languages.
In addition, as previously stated I believe that an important advantage of such project is extending Apertium to work well with Semitic languages. This would allow, in the long run, developing pairs for languages with an immensely large speakers population (Arabic, Hebrew & Tigrigna alone consisting of approx. 232 million speakers[3]).

Work and research done, resources

New language pair prototype

I have spent several hours experimenting with Apertium and have completed the New Language Pair HOWTO.
I have a working prototype for a Maltese-Hebrew translation of sentences in the form of “I see a gramophone" (with plural forms for either the subject or the object). It is committed to the Incubator under “apertium-mt-he”[4].

Morphological Analysers and Corpora

I have looked into obtaining access to a non-probabilistic morphological analyser for Hebrew.
I learned that hspell[5], a popular open-source Hebrew spell-checker (GPL), has a good morphological analyser feature. Apertium mentor Francis Tyers and I experimented with using it along with Apertium tools[6] to generate an extensive Hebrew dictionary. The results were of good quality and can certainly be a basis for the language pair work.
Alternatively, I also found a dedicated analyser product[7], done at the Technion CS school. It is licensed under GPL, but surprisingly not available for download. I’ve been in touch with related faculty members to get access to it in time for the project.
Both said projects have collected and published vast Hebrew corpora files, collected from various sources.

As for Maltese, a notable effort is the MaltiLex project[8][9][10] lead by Mr. M. Rosner of University of Malta. Judging by published articles, it might be a good source for work done on text archives and a morphological analyser. I will try reaching out to the project team to ask for access.
I should, however, assume the lack of such resources (other than some helpful articles published and available) and would plan working without a Maltese morphological analyser.

Parallel Corpora

A broad and well-formed source of parallel text in both languages would be needed for understanding syntax and morphology translation patterns and for automatic testing. During the community bonding period I will research finding a good source for parallel corpora.
Mentor Francis Tyers suggested checking bible translations and made a sample of an aligned chapter text[11]. Both the Hebrew and the Maltese texts are available fully and are well-formatted (I would have to do some minor manipulations to the Maltese texts).
The results as presented in the sample alignment are promising. I found many similarities between the languages in regards to word order, part-of-speech forms, morpheme behavior and even sound of words.
Further research and work on such corpora will be done in the community bonding period and in Week 5.

Grammar

I am a native speaker of Hebrew and have access to the Hebrew language library at my university, with enough grammar resources in case I require such reference.

I have already obtained (digitally or otherwise) the following Maltese grammar books:

  • J. Aquilina (1999), Teach Yourself Maltese Complete Course. [15]
  • J. Vella (2004), Learn Maltese: Why Not? [16]
  • A. Bord (1997), Maltese (Descriptive Grammars). [17]

I have confirmed other resources are also available in the university library.

Computational Linguistics resources

  • D. Dannélls, J. Camilleri, Verb Morphology of Hebrew and Maltese - Towards an Open Source Type Theoretical Resource Grammar in GF [18].
  • R. Hobberman, M. Aronoff (2003), The verbal morphology of Maltese: From Semitic to Romance. [19]
  • S. Wintner, S. Yona (2007), A finite-state morphological grammar of Hebrew. [20]
  • A. Itai, S. Wintner, S. Yona (2006), A Computational Lexicon of Contemporary Hebrew. [21]
  • S. Wintner, S. Yona (2004), A finite-state based morphological analyzer for Hebrew. [22]
  • S. Wintner, S. Yona (2003), Resources for Processing Hebrew. [23]
  • D. Jurafsky, J. Martin (2009), Speech and Language Processing. [24]

Work plan

I have listed below an estimated work plan, based on plans I found of other language pairs.
Due to scheduled exams (see “Non- Summer of Code plans” below), the first few weeks would probably need to be spaced out more, depending on how flexible the deliverable dates can be.

Community Bonding Period

  • Set up working environment (mostly done, but tweaks and preparations probably needed).
  • Familiarize with various Apertium tools needed for development.
  • Gather proper grammar resources for Hebrew and Maltese.
  • Study Maltese grammar rules thoroughly.
  • Familiarize with using morphological analyser for Hebrew (and Maltese, if one is available).
  • Get monolingual and multilingual aligned corpora for analysis.
  • Prepare a list of words sorted by frequency of accordance for Maltese.

Week 1

  • Write test scripts.
  • Add missing close-class words.

Week 2

  • Work on Hebrew monodix, adding open-class words according to frequency list.

Week 3

  • Work on Maltese monodix, adding open-class words according to frequency list.

Week 4

  • Complete work on monolingual dictionaries, adding missing words and handling exceptions.

Deliverable 1: Desirable coverage for both dictionaries.

Week 5

  • Generate translational data from available Maltese dictionary.
  • Research getting translational data using parallel corpora.
  • Add basic transfer rules for tests.

Week 6

  • Work on bilingual dictionary.

Week 7

  • Work on bilingual dictionary.

Week 8

  • Prepare a list of frequent word sequences.
  • Bring dictionaries to a consistent state (successful vocabulary tests).

Deliverable 2: Bilingual dictionary stable.

Week 9

  • Add multi-words with translations according to list.
  • Generate sample tagged training corpora.
  • Study word order rules of both languages.

Week 10

  • Work on tag definition files.
  • Carry out supervised tagger training for both languages.
  • Work on transfer rules.

Week 11

  • Work on transfer rules.
  • Carry out through regression tests.

Week 12

  • Manually check dictionaries to spot possible errors.
  • Clean up, evaluation of results.

Project completed.

Skills and qualifications

I am currently on my first year pursuing a Bachelor of Science degree in Computer Science and Linguistics at Tel-Aviv University, Israel, participating in a pilot CL program.
By the time the Summer of Code program starts, I will be done with an advanced CL seminar, taught by Dr. Roni Katzir, head of the CL program. The seminar is based on Daniel Jurafsky’s book Speech and Language Processing[12].
I have consulted Dr. Katzir and according to him, the seminar would provide enough background information for the project.

I have been programming professionally for 5 years before attending TAU[13].
Most of my work has been development of web-based software, using open-source technologies and languages such as PHP, Ruby on Rails, Python and Django and various JavaScript frameworks.
I am also familiar with writing C/C++, Java and basic XML manipulation and handling.

As mentioned, I have previously contributed to some open-source projects such as PHP, Mozilla and Wordpress. I also published some open-source projects of my own, in particular plug-ins to Wordpress and jQuery.
I have also contributed numerous translations to Hebrew for open-source projects, among others documentation for PHP, The Open Source Definition for OSI, and texts for WordPress and PHPMailer.

Non- Summer of Code plans

Summer of Code is my main plan for the summer. However unfortunately, the semester in TAU ends on June 10th, and I have three final exams scheduled:

  • on June 19th
  • on June 29th
  • on July 5th

I will have to focus on these for a few days each. I am, however, very flexible with my time and will work hard nights and weekends to make up for that time. An additional week of work could also be agreed upon.

I am also currently employed in a partial student job for a company which is aware of my application and will make any required accommodations to fit the schedule.
I am certain I will have more than 35 free hours to develop for Apertium.




Grazzi.
תודה.



Footnotes

  1. Wikiquote, Maltese proverbs. [1]
  2. N. Berdichevsky (1995), Maltese and Hebrew - Two Cases of Cultural Survival. [2]
  3. SIL International (2009), 2009 Ethnologue. [3]
  4. [4]
  5. [5]
  6. Speling tools [6], Paradigm chopper [7]
  7. [8]
  8. M. Rosner, J. Caruana, R. Fabri, Maltilex: A Computational Lexicon for Maltese. [9]
  9. M. Rosner J. Caruana, R. Fabri, Linguistics and Computational Aspects of Maltilex. [10]
  10. [11]
  11. Aligned text: first lines of Genesis 1. [12]
  12. D. Jurafsky (2009), Speech and Language Processing. [13]
  13. My Curriculum Vitae at LinkedIn, [14]