Difference between revisions of "Talk:Apertium-quality"

From Apertium
Jump to navigation Jump to search
Line 12: Line 12:

This interests me the most for countries such as the Philippines, where access to education can be limited, so having free translation software available would make it much easier to gain access to material that has never been translated into one region's native tongue.
This interests me the most for countries such as the Philippines, where access to education can be limited, so having free translation software available would make it much easier to gain access to material that has never been translated into one region's native tongue.

: Aren't expectations about MT a bit too optimistic here? Fortunately, it does not affect the quality of the proposal --[[User:Mlforcada|Mlforcada]] 17:10, 1 May 2011 (UTC)

==Interest in Apertium==
==Interest in Apertium==

Revision as of 17:10, 1 May 2011


Week 1 — 25th April

  • Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
  • Emailed Francis a written proof of setuptools adequately meeting expectations and requirements.

Week 2 — 2nd May

  • Converted LaTeX source to Wikimedia format, and placed below this section for annotation.


Interest in Machine Translation

Since I was a high school student, I had a strong interest in languages, especially the grammatical structures that separated the languages, and I have had a keen interest in etymology from the age of 12. One major reason I enjoy the concept of machine translation is that it is conceptually a zero-cost translator. Once implemented, you can translate an unlimited number of documents without having to pay a translator a cent to get a possibly near perfect translation.

This interests me the most for countries such as the Philippines, where access to education can be limited, so having free translation software available would make it much easier to gain access to material that has never been translated into one region's native tongue.

Aren't expectations about MT a bit too optimistic here? Fortunately, it does not affect the quality of the proposal --Mlforcada 17:10, 1 May 2011 (UTC)

Interest in Apertium

Having seen the ability of the software to get very close to a near perfect translation, I do not doubt that ability of the software to meet its given goal.

Tasks of Interest

In order of preference:

  1. Quality control framework
  2. Adopt a language pair (my tgl-ceb dictionary)

Quality Control Framework

The language I would use to implement this is Python, as I am most comfortable using this language, and due to the platform precedence of Python over PHP and its vast array of modules available for linguistic purposes, I think it would be the best choice for the implementation of a QA framework.

I propose the creation of a Python module entitled: 'Apertium Quality Assurance', with specific submodules covering each objective of regression testing, corpus generation, coverage testing, dictionary statistics and average ambiguity. This module will allow a programmer to use specific portions of the code for their own projects, while also allowing the dictionary developer access to these modules through command line front-ends not unlike the current Apertium tools.

The module will be able to be run standalone, or installed into the Python library directory, and will therefore be easily used with other applications or with simple standalone frontends.

Core ApertiumQA module

The core module will be responsible for the cross-module functionality, such as logging statistics and generation of graphs.

The use of graphing will allow any member of the Apertium team to gauge the development and success rate of given dictionary by simply having a look at a few statistics in a nice, visual manner, giving clear evidence of development.

Statistics will be stored in either an sqlite database or an XML file. Examples of statistics that will be stored include:

  1. Date
  2. SVN revision
  3. Regression test error rate
  4. Coverage level
  5. Test success rate

There will be further statistics added as development of the library continues, as more statistical valuable numerics become more obvious. The frontend for the module with generate a HTML file with all graphics, legends and required data in an easy to parse format.

Regression Testing

This submodule will use YAML, JSON, XML or CSV configuration to test for regressions and check specific coverage situations.

An example of English->French in YAML:

Pronoun check:
    I eat: je mange
    you eat: [tu manges, vous mangez]
    he eats: il mange
    she eats: elle mange
    one eats: on mange
    we eat:  [nous mangeon, on mange]
    they eat: [ils mangent, elles mangent]

As you may see, this syntax allows for non-programmers the ability to easily define tests in an easy-to-use syntax, without limiting programmers and others from using a syntax they are more comfortable with, such as JSON, XML or CSV.

The brackets allow one to have multiple correct response for a given test item.

The code would simply run through the tests as required, giving output as to whether their were any failures or passes, depending on settings selected (not unlike HfstTester.py). It will also allow the automatic reversal of the tests, allowing for one to run a French->English test on the same configuration without the need to needlessly rewrite the test in reverse.

Corpus Generator

This submodule will implement basic functionality for generating corpuses from any given text should the lines meet a given heuristic criteria. Examples of such user-configurable criteria include: length of sentence, acceptable limit of punctuation symbols, acceptable limit of numerals, excessive proper nouns, excessive English or other lingual terms per sentence, etc.

An example output line would be:

1. The quick brown fox jumps over the lazy dog.

As you can see, this is not unlike the corpus that can be found in en-eo.

A specific subclass of this module will be created for generating Wikipedia-based corpuses due to the significant differences between plain text and Wikimedia markup, and the fact it requires parsing XML in rather great depth. It will make use of a modified version of esperantowiki-xml2txt.py for parsing the Wikimedia markup.

Corpus Testing

This submodule will be not unlike that of the testcorpus_en-eo.sh that can be found in apertium-eo-en, except it will be reimplemented in Python and will be easily user-configurable.

Coverage Testing

This submodule will be not unlike corpus-stat-en-eo.sh in that it will give a count of tokenised words, a count of unknown words, and a list of unknown words, and give a calculation of the coverage.

Average Ambiguity

This submodule will use Apertium tools to find ambiguity in output, get the average, and then output a sorted descending list of highest to lowest ambiguity.

Possible Timetable

  • Week 1 -- 4 & implement core methods of all listed modules above
  • Deliverable 1: an alpha/beta-quality QA library and frontend tools
  • Week 5 -- 8: ensure all stubs and TODOs are completed, all goals met
  • Deliverable 2: an RC-quality QA library and frontend tools
  • Week 9 -- 12 & real world testing, extra features beyond core specification
  • Deliverable 3: a production-quality QA library and frontend tools

Important Dates:

  • June 10: Autumn semester ends
  • June 11: Exam session commences
  • July 1: Exam session ends
  • August 1: Spring semester begins

The beginning of GSoC overlaps with the final weeks of my Autumn semester, however, this will be no issue, as I currently only attend University twice a week, and have plenty of free time to spend working on GSoC. I will have two examinations during the exam period, one being a Java exam and one being a Mathematics exam. During the week leading up to the Mathematics examination, I may be a little difficult to contact and/or spend very little time on GSoC. I will make up for this by working much harder in the later weeks.

As for commercial work, I am a sysadmin and lecturer to high school teachers on how to best use their equipment, so a few hours a week I may be working, although this should not clash with GSoC whatsoever.

As you can see, my 'summer' ends approximately 3 weeks before the completion of GSoC. At this point I cannot guarantee what my University timetable will look like whatsoever, but I can assure you that even in the case where I have no time during the week to work on GSoC, I will make full use of my weekends to complete whichever parts of the project that are incomplete, although, as you can see by my proposed timetable, it is unlikely there will be too much work to be done in the final 4 weeks.


I have been studying and using Python constantly for at least three years, dabbling with open source projects all the way through. I have submitted patches to the GemRB project, and I worked with debian-installer in order to create one of the first working installers for the EeePC 701 which lacked a CD-ROM drive. I have also previously worked with Apertium, such as with the creation of apertium-verbconj, or my adoption of the Tagalog-Cebuano language pair.

In order to conjugate Tagalog verbs correctly, it was required that we use HFST as Apertium does not well support infixation. As it turns out, twolc is one of the most painful syntaxes I have ever experienced in my life, so I attempted to implement infixing using nothing but lexc. I completed this task through the liberal use of flag diacritics.

For testing out my HFST dictionary, I implemented =HfstTester.py= , which is being used by divvun.no and Sjur Moshagen is constantly in contact with me requesting new features and recommending changes with my code. They have made their own mirror of my code, and there is even a whole page explaining its use and feature wishlist. The skills I learnt implementing this and working with Sjur Moshagen can easily be transferred to a Quality Assurance framework for Apertium.

I have implemented a test application that creates a semi-working corpus from a wikipedia dump. Firstly, you download the wikipedia dump you wish you use. You then simply run the script with the first parameter being the dictionary and the second one being your output file. An optional third parameter limits the amount of lines of output. The application makes use of NLTK to parse for sentences, and uses a very rudimentary wikimedia syntax stripper that needs much more work to be considered anything than test code. The output quality for languages such as Norwegian or English is very good, as compared with Wikipedia's such as Tagalog, where the general article quality is much lower in both content and in (ab)use of syntax.

As I study ICT Engineering and International Studies (BEng DipEngPrac BArts) at University of Technology, Sydney, I have a broad range of subjects at my disposal. A unit called 'Introduction to Digital Systems' went into great detail about the mathematics behind finite state systems and we were required to learn PIC assembler. I have found these two skills to be invaluable for having a firm grounding in how a language pair works and allowed me to have a clear idea of how I would implement verb conjugation for Tagalog using lexc.

In order of strength, programming languages I can use are: Python, Java, Vala, C, C++ and a smattering of interpreted languages like Perl and PHP.


I look forward to correspondence with the larger Apertium team and hope that this year I may make the most of Google Summer of Code in assisting Apertium with their goals.
