Difference between revisions of "Talk:Apertium-quality"

From Apertium
Jump to navigation Jump to search
 
(20 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
= Menu =
  +
==== Getting Started ====
  +
* [[Quality_control_framework/Installation|Installation]]
  +
* [[Quality_control_framework/Usage|Usage]]
  +
  +
==== Technical Documentation ====
  +
* [[Quality_control_framework/Proposal|Proposal]]
  +
* [[Quality_control_framework/XML_Schema|XML Schema]]
  +
 
= Notes =
 
= Notes =
  +
== Community Bonding Period ==
 
=== Week 1 — 25th April ===
 
=== Week 1 — 25th April ===
 
* Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
 
* Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
Line 6: Line 16:
 
=== Week 2 — 2nd May ===
 
=== Week 2 — 2nd May ===
 
* Converted LaTeX source to Wikimedia format, and placed below this section for annotation.
 
* Converted LaTeX source to Wikimedia format, and placed below this section for annotation.
  +
* Completed example regtest.py
  +
* Added Installation and Usage pages, uploaded initial files.
   
  +
=== Week 3 — 9th May ===
= Proposal =
 
  +
* Fixed a Python regression-related bug in regtest.py
==Interest in Machine Translation==
 
  +
* Fixed a personal regression in setup.py
 
  +
* Plan to add autogen.sh for config
::: I have put comments like these all over the text --[[User:Mlforcada|Mlforcada]] 13:26, 6 May 2011 (UTC)
 
  +
* Consider using virtualenv for rootless installations
 
  +
* Fixed installation instructions
Since I was a high school student, I had a strong interest in languages, especially the grammatical structures that separated the languages, and I have had a keen interest in etymology from the age of 12. One major reason I enjoy the concept of machine translation is that it is conceptually a zero-cost translator. Once implemented, you can translate an unlimited number of documents without having to pay a translator a cent to get a possibly near perfect translation.
 
  +
* SVN and git now synched
 
This interests me the most for countries such as the Philippines, where access to education can be limited, so having free translation software available would make it much easier to gain access to material that has never been translated into one region's native tongue.
 
 
: Aren't expectations about MT a bit too optimistic here? Fortunately, it does not affect the quality of the proposal --[[User:Mlforcada|Mlforcada]] 17:10, 1 May 2011 (UTC)
 
 
==Interest in Apertium==
 
Having seen the ability of the software to get very close to a near perfect translation, I do not doubt that ability of the software to meet its given goal.
 
 
: ''Non sequitur'': quality control may be excellent independently of actual translation quality --[[User:Mlforcada|Mlforcada]] 17:12, 1 May 2011 (UTC)
 
 
==Tasks of Interest==
 
In order of preference:
 
 
#Quality control framework
 
::: This is the one I will finally be mentoring --[[User:Mlforcada|Mlforcada]] 13:01, 6 May 2011 (UTC)
 
#Adopt a language pair (my tgl-ceb dictionary)
 
 
===Quality Control Framework===
 
The language I would use to implement this is Python, as I am most comfortable using this language, and due to the platform precedence of Python over PHP and its vast array of modules available for linguistic purposes, I think it would be the best choice for the implementation of a QA framework.
 
 
::: Please explain the concept of ''platform precedence'' in this context --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: Python is more often than not installed by default than PHP on most Linux distributions. --[[User:Gekz|Gekz]] 08:33, 7 May 2011 (UTC)
 
 
::: Are these Python linguistic resources free software? --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: There are almost no proprietary Python modules, so yes. --[[User:Gekz|Gekz]] 08:33, 7 May 2011 (UTC)
 
 
I propose the creation of a Python module entitled: 'Apertium Quality Assurance', with specific submodules covering each objective of regression testing, corpus generation, coverage testing, dictionary statistics and average ambiguity. This module will allow a programmer to use specific portions of the code for their own projects, while also allowing the dictionary developer access to these modules through command line front-ends not unlike the current Apertium tools.
 
 
::: Wouldn't it be nice if the proposal could be more specific about these? I think drafting manpages would help --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
The module will be able to be run standalone, or installed into the Python library directory, and will therefore be easily used with other applications or with simple standalone frontends.
 
 
====Core ApertiumQA module====
 
The core module will be responsible for the cross-module functionality, such as logging statistics and generation of graphs.
 
 
::: The meaning of cross-module here is unclear. Does it mean "across ApertiumQA modules"? --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
The use of graphing will allow any member of the Apertium team to gauge the development and success rate of given dictionary by simply having a look at a few statistics in a nice, visual manner, giving clear evidence of development.
 
 
::: An idea on which graphic formats are aimed at would help --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: Well laid-out tabled data, with useful charts such as success ratio in regression tests per date. --[[User:Gekz|Gekz]] 08:35, 7 May 2011 (UTC)
 
 
Statistics will be stored in either an sqlite database or an XML file. Examples of statistics that will be stored include:
 
 
::: Isn't sqlite an overkill here? I believe an XML file with style files to generate reports would be better --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: Sqlite was an option, not a requirement. It can be ignored. --[[User:Gekz|Gekz]] 08:35, 7 May 2011 (UTC)
 
 
#Date
 
#SVN revision
 
#Regression test error rate
 
 
::: In addition to giving the SVN revision number, the revision number or version of the regression test should also be given --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: Agreed. --[[User:Gekz|Gekz]] 08:35, 7 May 2011 (UTC)
 
 
#Coverage level
 
 
::: Do you refer to "naïve coverage" here? Could this finally be interfaced with the results of the project about [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Detect_hidden_unknown_words "hidden unknown words"]? --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: Yes to both. --[[User:Gekz|Gekz]] 08:35, 7 May 2011 (UTC)
 
 
#Test success rate
 
 
There will be further statistics added as development of the library continues, as more statistically valuable numerics become more obvious. The frontend for the module with generate a HTML file with all graphics, legends and required data in an easy to parse format.
 
 
::: An XHTML file? --[[User:Mlforcada|Mlforcada]]
 
 
:::: Either/or. --[[User:Gekz|Gekz]] 08:35, 7 May 2011 (UTC)
 
 
====Regression Testing====
 
This submodule will use YAML, JSON, XML or CSV configuration to test for regressions and check specific coverage situations.
 
 
An example of English->French in YAML:
 
::: The "wikifiability" of regression testing is always a good idea. YAML uses indents à la Python does it? --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
::::Any test format should be able to be kept in sync with the tests on the Wiki. The reason being that often people who don't use SVN/Apertium/Linux edit the tests. So I ask someone "Hey, go and fix the translations on this page" and they do, without having to install anything. The Wiki should ideally be the "highest priority" source. Meaning that if you have multiple conflicting copies, the Wiki is what should be gone by. - [[User:Francis Tyers|Francis Tyers]] 09:18, 7 May 2011 (UTC)
 
 
<pre>
 
Pronoun check:
 
I eat: je mange
 
you eat: [tu manges, vous mangez]
 
he eats: il mange
 
she eats: elle mange
 
one eats: on mange
 
we eat: [nous mangeons, on mange]
 
they eat: [ils mangent, elles mangent]
 
</pre>
 
 
As you may see, this syntax allows for non-programmers the ability to easily define tests in an easy-to-use syntax, without limiting programmers and others from using a syntax they are more comfortable with, such as JSON, XML or CSV.
 
 
The brackets allow one to have multiple correct response for a given test item.
 
 
The code would simply run through the tests as required, giving output as to whether their were any failures or passes, depending on settings selected (not unlike <code>HfstTester.py</code>). It will also allow the automatic reversal of the tests, allowing for one to run a French->English test on the same configuration without the need to needlessly rewrite the test in reverse.
 
 
::: Give a reference to HfstTester.py for convenience here --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: [https://victorio.uit.no/langtech/trunk/gt/script/HfstTester.py] --[[User:Gekz|Gekz]] 08:46, 7 May 2011 (UTC)
 
 
::: Please consider that regression tests may not be reversible in general -[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: This is true, and can easily be enforced by a configuration rule. It will be made clearer when I release the planned spec. --[[User:Gekz|Gekz]] 08:46, 7 May 2011 (UTC)
 
 
====Corpus Generator====
 
 
::: Do you mean ''monolingual'' corpus generator?? --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: At this stage, yes. As has been pointed out, it would be best to be named ''Corpus Extractor'' as it is simply making more concise and useful corpora based on user-defined heuristics. --[[User:Gekz|Gekz]] 08:46, 7 May 2011 (UTC)
 
 
This submodule will implement basic functionality for generating corpora from any given text should the lines meet a given heuristic criteria. Examples of such user-configurable criteria include: length of sentence, acceptable limit of punctuation symbols, acceptable limit of numerals, excessive proper nouns, excessive English or other lingual terms per sentence, etc.
 
 
::: Explain what do you mean by "acceptable limit of punctuation symbols" or "numerals" --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: User-defined heuristics control the "acceptable limit" of any input. Basically a regular expression plus a limit, and if you pass that limit, the sentence is disregarded as "not meeting the criteria" --[[User:Gekz|Gekz]] 08:46, 7 May 2011 (UTC)
 
 
::: A crazy idea: [http://bitextor.sf.net Bitextor]-mediated bilingual input for semi-automatic generation of regression testing? --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: Possible. I will look into it if I find the time. --[[User:Gekz|Gekz]] 08:46, 7 May 2011 (UTC)
 
 
An example output line would be:<br />
 
<pre>1. The quick brown fox jumps over the lazy dog.</pre>
 
 
As you can see, this is not unlike the corpus that can be found in en-eo.
 
 
A specific subclass of this module will be created for generating Wikipedia-based corpuses due to the significant differences between plain text and Wikimedia markup, and the fact it requires parsing XML in rather great depth. It will make use of a modified version of <code>esperantowiki-xml2txt.py</code> for parsing the Wikimedia markup.
 
 
====Corpus Testing====
 
This submodule will be not unlike that of the <code>testcorpus_en-eo.sh</code> that can be found in apertium-eo-en, except it will be reimplemented in Python and will be easily user-configurable.
 
 
::: Explain this in a standalone way --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
====Coverage Testing====
 
This submodule will be not unlike <code>corpus-stat-en-eo.sh</code> in that it will give a count of tokenised words, a count of unknown words, and a list of unknown words, and give a calculation of the coverage.
 
 
::: I assume you talk here about naïve coverage --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
:::: Again, yes. --[[User:Gekz|Gekz]] 08:46, 7 May 2011 (UTC)
 
 
====Average Ambiguity====
 
 
::: The average is not enough. More detailed statistics are easy to gather and useful --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
This submodule will use Apertium tools to find ambiguity in output, get the average, and then output a sorted descending list of highest to lowest ambiguity.
 
 
::: A list of what? --[[User:Mlforcada|Mlforcada]] 13:17, 6 May 2011 (UTC)
 
 
==Possible Timetable==
 
 
::: It would be very useful to put absolute dates here, that is, regular month and day dates --[[User:Mlforcada|Mlforcada]] 13:20, 6 May 2011 (UTC)
 
 
* Week 1 -- 4 & implement core methods of all listed modules above
 
* '''Deliverable 1''': an alpha/beta-quality QA library and frontend tools
 
* Week 5 -- 8: ensure all stubs and TODOs are completed, all goals met
 
* '''Deliverable 2''': an RC-quality QA library and frontend tools
 
 
::: What is RC? --[[User:Mlforcada|Mlforcada]]
 
 
* Week 9 -- 12 & real world testing, extra features beyond core specification
 
* '''Deliverable 3''': a production-quality QA library and frontend tools
 
 
Important Dates:
 
 
* June 10: Autumn semester ends<br/>
 
* June 11: Exam session commences<br/>
 
* July 1: Exam session ends<br/>
 
* August 1: Spring semester begins<br/>
 
 
The beginning of GSoC overlaps with the final weeks of my Autumn semester, however, this will be no issue, as I currently only attend University twice a week, and have plenty of free time to spend working on GSoC. I will have two examinations during the exam period, one being a Java exam and one being a Mathematics exam. During the week leading up to the Mathematics examination, I may be a little difficult to contact and/or spend very little time on GSoC. I will make up for this by working much harder in the later weeks.
 
 
As for commercial work, I am a sysadmin and lecturer to high school teachers on how to best use their equipment, so a few hours a week I may be working, although this should not clash with GSoC whatsoever.
 
 
::: What kind of "sys" do you "admin"? --[[User:Mlforcada|Mlforcada]] 13:20, 6 May 2011 (UTC)
 
 
As you can see, my 'summer' ends approximately 3 weeks before the completion of GSoC. At this point I cannot guarantee what my University timetable will look like whatsoever, but I can assure you that even in the case where I have no time during the week to work on GSoC, I will make full use of my weekends to complete whichever parts of the project that are incomplete, although, as you can see by my proposed timetable, it is unlikely there will be too much work to be done in the final 4 weeks.
 
 
==Qualifications==
 
I have been studying and using Python constantly for at least three years, dabbling with open source projects all the way through. I have submitted patches to the GemRB project, and I worked with debian-installer in order to create one of the first working installers for the EeePC 701 which lacked a CD-ROM drive. I have also previously worked with Apertium, such as with the creation of apertium-verbconj, or my adoption of the Tagalog-Cebuano language pair.
 
   
  +
== Coding Period ==
::: It would be nice to name the other projects, so that we do cross-propaganda --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
  +
=== Week 1 &mdash; 23rd May ===
::: Hey! I own one of them EeePC 701s! I'm interested! --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
  +
* Completed autogen.sh
   
  +
== Todo ==
In order to conjugate Tagalog verbs correctly, it was required that we use HFST as Apertium does not well support infixation. As it turns out, twolc is one of the most painful syntaxes I have ever experienced in my life, so I attempted to implement infixing using nothing but lexc. I completed this task through the liberal use of flag diacritics.
 
  +
===Tests and stats===
   
  +
====Monolingual corpus====
::: References to twolc needed --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
::: "twolc is" or "twolc has" --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
::: "liberal use of flag diacritics": "liberal" but "systematic"? --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
   
  +
* dicts: Coverage
For testing out my HFST dictionary, I implemented =HfstTester.py= , which is being used by divvun.no and Sjur Moshagen is constantly in contact with me requesting new features and recommending changes with my code. They have made their own mirror of my code, and there is even a whole page explaining its use and feature wishlist. The skills I learnt implementing this and working with Sjur Moshagen can easily be transferred to a Quality Assurance framework for Apertium.
 
  +
* rules: Rule counting (CG, apertium-transfer)
  +
* rules: number of rules
  +
* dicts: number of entries (sl mono, sl-tl, tl mono) -- lttoolbox/hfst
  +
* dicts: (monolingual) mean ambiguity
  +
* system: translation speed (per module?)
  +
* dicts: (bilingual) mean fertility -- e.g. number of translations per SL/TL word
  +
* rules: for disambiguation, if there is cg + apertium tagger, how much work does CG do and how much does apertium-tagger do ? (count LU input to CG, LU output from CG and LU output form apertium-tagger)
   
  +
====Tests====
::: Add inline web links to your stuff (instead of a list at the end) --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
   
  +
* dictionary tests (e.g. hfst-tester)
I have implemented a test application that creates a semi-working corpus from a wikipedia dump. Firstly, you download the wikipedia dump you wish you use. You then simply run the script with the first parameter being the dictionary and the second one being your output file.
 
  +
* regression tests
  +
* pending tests
  +
* testvoc
  +
* testvoc+bidixvoc (some language pairs have bilingual dictionaries with more than one translation for a given SL word, at the moment testvoc will only ever test the default translation. testvoc+bidixvoc will test them all)
  +
* generation test
  +
* corpus test
   
  +
====Parallel corpus====
::: I am not sure I understand this. See my comment above about manpages. --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
   
  +
* WER, PER, BLEU against reference
An optional third parameter limits the amount of lines of output. The application makes use of NLTK to parse for sentences, and uses a very rudimentary wikimedia syntax stripper that needs much more work to be considered anything than test code. The output quality for languages such as Norwegian or English is very good, as compared with Wikipedia's such as Tagalog, where the general article quality is much lower in both content and in (ab)use of syntax.
 
   
  +
====Graphs====
::: "wikimedia syntax stripper": can't you get plain text wikimedia dumps? --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
::: "to parse for sentences": what kind of parsing? --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
   
  +
* coverage over time
As I study ICT Engineering and International Studies (BEng DipEngPrac BArts) at University of Technology, Sydney, I have a broad range of subjects at my disposal. A unit called 'Introduction to Digital Systems' went into great detail about the mathematics behind finite state systems and we were required to learn PIC assembler. I have found these two skills to be invaluable for having a firm grounding in how a language pair works and allowed me to have a clear idea of how I would implement verb conjugation for Tagalog using lexc.
 
  +
* number of rules over time
  +
* mean ambiguity over time
  +
* number of dict entries over time
  +
* translation speed over time
  +
* WER/PER/BLEU over time
  +
* percentage of regression tests passed over time
   
  +
== Feature Requests ==
::: What is PIC assembler? URL? --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
  +
* Cache the wiki Regression test web page so that we can test when the wiki is offline or when stuck in airports with expensive wifi
::: lexc? hfst? shouldn't we improve lttoolbox to provide all that functionality? --[[User:Mlforcada|Mlforcada]] 13:25, 6 May 2011 (UTC)
 
   
  +
== Extensions ==
In order of strength, programming languages I can use are: Python, Java, Vala, C, C++ and a smattering of interpreted languages like Perl and PHP.
 
  +
=== Sanity Tests ===
  +
Simple allow the use of a sanity_tests directory in a dictionary directory, and if found, run any scripts found in there, storing their name and return value in the quality-stats.xml. This allows the scripts to be in any language given they return non-zero return values on error.
   
  +
Possible tests:
==Conclusion==
 
I look forward to correspondence with the larger Apertium team and hope that this year I may make the most of Google Summer of Code in assisting Apertium with their goals.
 
   
  +
* Superblank order test
==Sources==
 
* http://divvun.no/doc/tools/HfstTester.html
 
* http://www.bbqsrc.net/#hfsttester
 
* https://github.com/bbqsrc
 

Latest revision as of 18:20, 21 August 2011

Menu[edit]

Getting Started[edit]

Technical Documentation[edit]

Notes[edit]

Community Bonding Period[edit]

Week 1 — 25th April[edit]

  • Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
  • Emailed Francis a written proof of setuptools adequately meeting expectations and requirements.

Week 2 — 2nd May[edit]

  • Converted LaTeX source to Wikimedia format, and placed below this section for annotation.
  • Completed example regtest.py
  • Added Installation and Usage pages, uploaded initial files.

Week 3 — 9th May[edit]

  • Fixed a Python regression-related bug in regtest.py
  • Fixed a personal regression in setup.py
  • Plan to add autogen.sh for config
  • Consider using virtualenv for rootless installations
  • Fixed installation instructions
  • SVN and git now synched

Coding Period[edit]

Week 1 — 23rd May[edit]

  • Completed autogen.sh

Todo[edit]

Tests and stats[edit]

Monolingual corpus[edit]

  • dicts: Coverage
  • rules: Rule counting (CG, apertium-transfer)
  • rules: number of rules
  • dicts: number of entries (sl mono, sl-tl, tl mono) -- lttoolbox/hfst
  • dicts: (monolingual) mean ambiguity
  • system: translation speed (per module?)
  • dicts: (bilingual) mean fertility -- e.g. number of translations per SL/TL word
  • rules: for disambiguation, if there is cg + apertium tagger, how much work does CG do and how much does apertium-tagger do ? (count LU input to CG, LU output from CG and LU output form apertium-tagger)

Tests[edit]

  • dictionary tests (e.g. hfst-tester)
  • regression tests
  • pending tests
  • testvoc
  • testvoc+bidixvoc (some language pairs have bilingual dictionaries with more than one translation for a given SL word, at the moment testvoc will only ever test the default translation. testvoc+bidixvoc will test them all)
  • generation test
  • corpus test

Parallel corpus[edit]

  • WER, PER, BLEU against reference

Graphs[edit]

  • coverage over time
  • number of rules over time
  • mean ambiguity over time
  • number of dict entries over time
  • translation speed over time
  • WER/PER/BLEU over time
  • percentage of regression tests passed over time

Feature Requests[edit]

  • Cache the wiki Regression test web page so that we can test when the wiki is offline or when stuck in airports with expensive wifi

Extensions[edit]

Sanity Tests[edit]

Simple allow the use of a sanity_tests directory in a dictionary directory, and if found, run any scripts found in there, storing their name and return value in the quality-stats.xml. This allows the scripts to be in any language given they return non-zero return values on error.

Possible tests:

  • Superblank order test