Difference between revisions of "Talk:Apertium-quality"

From Apertium
Jump to navigation Jump to search
 
(28 intermediate revisions by 5 users not shown)
Line 1: Line 1:
  +
= Menu =
  +
==== Getting Started ====
  +
* [[Quality_control_framework/Installation|Installation]]
  +
* [[Quality_control_framework/Usage|Usage]]
  +
  +
==== Technical Documentation ====
  +
* [[Quality_control_framework/Proposal|Proposal]]
  +
* [[Quality_control_framework/XML_Schema|XML Schema]]
  +
 
= Notes =
 
= Notes =
  +
== Community Bonding Period ==
 
=== Week 1 — 25th April ===
 
=== Week 1 — 25th April ===
 
* Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
 
* Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
Line 6: Line 16:
 
=== Week 2 — 2nd May ===
 
=== Week 2 — 2nd May ===
 
* Converted LaTeX source to Wikimedia format, and placed below this section for annotation.
 
* Converted LaTeX source to Wikimedia format, and placed below this section for annotation.
  +
* Completed example regtest.py
  +
* Added Installation and Usage pages, uploaded initial files.
   
  +
=== Week 3 — 9th May ===
= Proposal =
 
  +
* Fixed a Python regression-related bug in regtest.py
==Interest in Machine Translation==
 
  +
* Fixed a personal regression in setup.py
Since I was a high school student, I had a strong interest in languages, especially the grammatical structures that separated the languages, and I have had a keen interest in etymology from the age of 12. One major reason I enjoy the concept of machine translation is that it is conceptually a zero-cost translator. Once implemented, you can translate an unlimited number of documents without having to pay a translator a cent to get a possibly near perfect translation.
 
  +
* Plan to add autogen.sh for config
 
  +
* Consider using virtualenv for rootless installations
This interests me the most for countries such as the Philippines, where access to education can be limited, so having free translation software available would make it much easier to gain access to material that has never been translated into one region's native tongue.
 
  +
* Fixed installation instructions
 
  +
* SVN and git now synched
: Aren't expectations about MT a bit too optimistic here? Fortunately, it does not affect the quality of the proposal --[[User:Mlforcada|Mlforcada]] 17:10, 1 May 2011 (UTC)
 
 
==Interest in Apertium==
 
Having seen the ability of the software to get very close to a near perfect translation, I do not doubt that ability of the software to meet its given goal.
 
 
: ''Non sequitur'': quality control may be excellent independently of actual translation quality --[[User:Mlforcada|Mlforcada]] 17:12, 1 May 2011 (UTC)
 
 
==Tasks of Interest==
 
In order of preference:
 
 
#Quality control framework
 
::: This is the one I will finally be mentoring --[[User:Mlforcada|Mlforcada]] 13:01, 6 May 2011 (UTC)
 
#Adopt a language pair (my tgl-ceb dictionary)
 
 
===Quality Control Framework===
 
The language I would use to implement this is Python, as I am most comfortable using this language, and due to the platform precedence of Python over PHP and its vast array of modules available for linguistic purposes, I think it would be the best choice for the implementation of a QA framework.
 
 
I propose the creation of a Python module entitled: 'Apertium Quality Assurance', with specific submodules covering each objective of regression testing, corpus generation, coverage testing, dictionary statistics and average ambiguity. This module will allow a programmer to use specific portions of the code for their own projects, while also allowing the dictionary developer access to these modules through command line front-ends not unlike the current Apertium tools.
 
 
The module will be able to be run standalone, or installed into the Python library directory, and will therefore be easily used with other applications or with simple standalone frontends.
 
 
====Core ApertiumQA module====
 
The core module will be responsible for the cross-module functionality, such as logging statistics and generation of graphs.
 
 
The use of graphing will allow any member of the Apertium team to gauge the development and success rate of given dictionary by simply having a look at a few statistics in a nice, visual manner, giving clear evidence of development.
 
 
Statistics will be stored in either an sqlite database or an XML file. Examples of statistics that will be stored include:
 
 
#Date
 
#SVN revision
 
#Regression test error rate
 
#Coverage level
 
#Test success rate
 
 
There will be further statistics added as development of the library continues, as more statistical valuable numerics become more obvious. The frontend for the module with generate a HTML file with all graphics, legends and required data in an easy to parse format.
 
 
====Regression Testing====
 
This submodule will use YAML, JSON, XML or CSV configuration to test for regressions and check specific coverage situations.
 
 
An example of English->French in YAML:
 
<pre>
 
Pronoun check:
 
I eat: je mange
 
you eat: [tu manges, vous mangez]
 
he eats: il mange
 
she eats: elle mange
 
one eats: on mange
 
we eat: [nous mangeon, on mange]
 
they eat: [ils mangent, elles mangent]
 
</pre>
 
 
As you may see, this syntax allows for non-programmers the ability to easily define tests in an easy-to-use syntax, without limiting programmers and others from using a syntax they are more comfortable with, such as JSON, XML or CSV.
 
 
The brackets allow one to have multiple correct response for a given test item.
 
 
The code would simply run through the tests as required, giving output as to whether their were any failures or passes, depending on settings selected (not unlike <code>HfstTester.py</code>). It will also allow the automatic reversal of the tests, allowing for one to run a French->English test on the same configuration without the need to needlessly rewrite the test in reverse.
 
 
====Corpus Generator====
 
This submodule will implement basic functionality for generating corpuses from any given text should the lines meet a given heuristic criteria. Examples of such user-configurable criteria include: length of sentence, acceptable limit of punctuation symbols, acceptable limit of numerals, excessive proper nouns, excessive English or other lingual terms per sentence, etc.
 
 
An example output line would be:<br />
 
<pre>1. The quick brown fox jumps over the lazy dog.</pre>
 
 
As you can see, this is not unlike the corpus that can be found in en-eo.
 
 
A specific subclass of this module will be created for generating Wikipedia-based corpuses due to the significant differences between plain text and Wikimedia markup, and the fact it requires parsing XML in rather great depth. It will make use of a modified version of <code>esperantowiki-xml2txt.py</code> for parsing the Wikimedia markup.
 
 
====Corpus Testing====
 
This submodule will be not unlike that of the <code>testcorpus_en-eo.sh</code> that can be found in apertium-eo-en, except it will be reimplemented in Python and will be easily user-configurable.
 
 
====Coverage Testing====
 
This submodule will be not unlike <code>corpus-stat-en-eo.sh</code> in that it will give a count of tokenised words, a count of unknown words, and a list of unknown words, and give a calculation of the coverage.
 
 
====Average Ambiguity====
 
This submodule will use Apertium tools to find ambiguity in output, get the average, and then output a sorted descending list of highest to lowest ambiguity.
 
   
  +
== Coding Period ==
==Possible Timetable==
 
  +
=== Week 1 &mdash; 23rd May ===
* Week 1 -- 4 & implement core methods of all listed modules above
 
  +
* Completed autogen.sh
* '''Deliverable 1''': an alpha/beta-quality QA library and frontend tools
 
* Week 5 -- 8: ensure all stubs and TODOs are completed, all goals met
 
* '''Deliverable 2''': an RC-quality QA library and frontend tools
 
* Week 9 -- 12 & real world testing, extra features beyond core specification
 
* '''Deliverable 3''': a production-quality QA library and frontend tools
 
   
  +
== Todo ==
Important Dates:
 
  +
===Tests and stats===
   
  +
====Monolingual corpus====
* June 10: Autumn semester ends<br/>
 
* June 11: Exam session commences<br/>
 
* July 1: Exam session ends<br/>
 
* August 1: Spring semester begins<br/>
 
   
  +
* dicts: Coverage
The beginning of GSoC overlaps with the final weeks of my Autumn semester, however, this will be no issue, as I currently only attend University twice a week, and have plenty of free time to spend working on GSoC. I will have two examinations during the exam period, one being a Java exam and one being a Mathematics exam. During the week leading up to the Mathematics examination, I may be a little difficult to contact and/or spend very little time on GSoC. I will make up for this by working much harder in the later weeks.
 
  +
* rules: Rule counting (CG, apertium-transfer)
  +
* rules: number of rules
  +
* dicts: number of entries (sl mono, sl-tl, tl mono) -- lttoolbox/hfst
  +
* dicts: (monolingual) mean ambiguity
  +
* system: translation speed (per module?)
  +
* dicts: (bilingual) mean fertility -- e.g. number of translations per SL/TL word
  +
* rules: for disambiguation, if there is cg + apertium tagger, how much work does CG do and how much does apertium-tagger do ? (count LU input to CG, LU output from CG and LU output form apertium-tagger)
   
  +
====Tests====
As for commercial work, I am a sysadmin and lecturer to high school teachers on how to best use their equipment, so a few hours a week I may be working, although this should not clash with GSoC whatsoever.
 
   
  +
* dictionary tests (e.g. hfst-tester)
As you can see, my 'summer' ends approximately 3 weeks before the completion of GSoC. At this point I cannot guarantee what my University timetable will look like whatsoever, but I can assure you that even in the case where I have no time during the week to work on GSoC, I will make full use of my weekends to complete whichever parts of the project that are incomplete, although, as you can see by my proposed timetable, it is unlikely there will be too much work to be done in the final 4 weeks.
 
  +
* regression tests
  +
* pending tests
  +
* testvoc
  +
* testvoc+bidixvoc (some language pairs have bilingual dictionaries with more than one translation for a given SL word, at the moment testvoc will only ever test the default translation. testvoc+bidixvoc will test them all)
  +
* generation test
  +
* corpus test
   
  +
====Parallel corpus====
==Qualifications==
 
I have been studying and using Python constantly for at least three years, dabbling with open source projects all the way through. I have submitted patches to the GemRB project, and I worked with debian-installer in order to create one of the first working installers for the EeePC 701 which lacked a CD-ROM drive. I have also previously worked with Apertium, such as with the creation of apertium-verbconj, or my adoption of the Tagalog-Cebuano language pair.
 
   
  +
* WER, PER, BLEU against reference
In order to conjugate Tagalog verbs correctly, it was required that we use HFST as Apertium does not well support infixation. As it turns out, twolc is one of the most painful syntaxes I have ever experienced in my life, so I attempted to implement infixing using nothing but lexc. I completed this task through the liberal use of flag diacritics.
 
   
  +
====Graphs====
For testing out my HFST dictionary, I implemented =HfstTester.py= , which is being used by divvun.no and Sjur Moshagen is constantly in contact with me requesting new features and recommending changes with my code. They have made their own mirror of my code, and there is even a whole page explaining its use and feature wishlist. The skills I learnt implementing this and working with Sjur Moshagen can easily be transferred to a Quality Assurance framework for Apertium.
 
   
  +
* coverage over time
I have implemented a test application that creates a semi-working corpus from a wikipedia dump. Firstly, you download the wikipedia dump you wish you use. You then simply run the script with the first parameter being the dictionary and the second one being your output file. An optional third parameter limits the amount of lines of output. The application makes use of NLTK to parse for sentences, and uses a very rudimentary wikimedia syntax stripper that needs much more work to be considered anything than test code. The output quality for languages such as Norwegian or English is very good, as compared with Wikipedia's such as Tagalog, where the general article quality is much lower in both content and in (ab)use of syntax.
 
  +
* number of rules over time
  +
* mean ambiguity over time
  +
* number of dict entries over time
  +
* translation speed over time
  +
* WER/PER/BLEU over time
  +
* percentage of regression tests passed over time
   
  +
== Feature Requests ==
As I study ICT Engineering and International Studies (BEng DipEngPrac BArts) at University of Technology, Sydney, I have a broad range of subjects at my disposal. A unit called 'Introduction to Digital Systems' went into great detail about the mathematics behind finite state systems and we were required to learn PIC assembler. I have found these two skills to be invaluable for having a firm grounding in how a language pair works and allowed me to have a clear idea of how I would implement verb conjugation for Tagalog using lexc.
 
  +
* Cache the wiki Regression test web page so that we can test when the wiki is offline or when stuck in airports with expensive wifi
   
  +
== Extensions ==
In order of strength, programming languages I can use are: Python, Java, Vala, C, C++ and a smattering of interpreted languages like Perl and PHP.
 
  +
=== Sanity Tests ===
  +
Simple allow the use of a sanity_tests directory in a dictionary directory, and if found, run any scripts found in there, storing their name and return value in the quality-stats.xml. This allows the scripts to be in any language given they return non-zero return values on error.
   
  +
Possible tests:
==Conclusion==
 
I look forward to correspondence with the larger Apertium team and hope that this year I may make the most of Google Summer of Code in assisting Apertium with their goals.
 
   
  +
* Superblank order test
==Sources==
 
* http://divvun.no/doc/tools/HfstTester.html
 
* http://www.bbqsrc.net/#hfsttester
 
* https://github.com/bbqsrc
 

Latest revision as of 18:20, 21 August 2011

Menu[edit]

Getting Started[edit]

Technical Documentation[edit]

Notes[edit]

Community Bonding Period[edit]

Week 1 — 25th April[edit]

  • Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
  • Emailed Francis a written proof of setuptools adequately meeting expectations and requirements.

Week 2 — 2nd May[edit]

  • Converted LaTeX source to Wikimedia format, and placed below this section for annotation.
  • Completed example regtest.py
  • Added Installation and Usage pages, uploaded initial files.

Week 3 — 9th May[edit]

  • Fixed a Python regression-related bug in regtest.py
  • Fixed a personal regression in setup.py
  • Plan to add autogen.sh for config
  • Consider using virtualenv for rootless installations
  • Fixed installation instructions
  • SVN and git now synched

Coding Period[edit]

Week 1 — 23rd May[edit]

  • Completed autogen.sh

Todo[edit]

Tests and stats[edit]

Monolingual corpus[edit]

  • dicts: Coverage
  • rules: Rule counting (CG, apertium-transfer)
  • rules: number of rules
  • dicts: number of entries (sl mono, sl-tl, tl mono) -- lttoolbox/hfst
  • dicts: (monolingual) mean ambiguity
  • system: translation speed (per module?)
  • dicts: (bilingual) mean fertility -- e.g. number of translations per SL/TL word
  • rules: for disambiguation, if there is cg + apertium tagger, how much work does CG do and how much does apertium-tagger do ? (count LU input to CG, LU output from CG and LU output form apertium-tagger)

Tests[edit]

  • dictionary tests (e.g. hfst-tester)
  • regression tests
  • pending tests
  • testvoc
  • testvoc+bidixvoc (some language pairs have bilingual dictionaries with more than one translation for a given SL word, at the moment testvoc will only ever test the default translation. testvoc+bidixvoc will test them all)
  • generation test
  • corpus test

Parallel corpus[edit]

  • WER, PER, BLEU against reference

Graphs[edit]

  • coverage over time
  • number of rules over time
  • mean ambiguity over time
  • number of dict entries over time
  • translation speed over time
  • WER/PER/BLEU over time
  • percentage of regression tests passed over time

Feature Requests[edit]

  • Cache the wiki Regression test web page so that we can test when the wiki is offline or when stuck in airports with expensive wifi

Extensions[edit]

Sanity Tests[edit]

Simple allow the use of a sanity_tests directory in a dictionary directory, and if found, run any scripts found in there, storing their name and return value in the quality-stats.xml. This allows the scripts to be in any language given they return non-zero return values on error.

Possible tests:

  • Superblank order test