Difference between revisions of "Talk:Apertium-quality"

From Apertium
Jump to navigation Jump to search
Line 13: Line 13:
 
* Added Installation and Usage pages, uploaded initial files.
 
* Added Installation and Usage pages, uploaded initial files.
   
<!--=== Week 3 &mdash; 9th May ===-->
+
=== Week 3 &mdash; 9th May ===
  +
* Fixed a Python regression-related bug in regtest.py
  +
* Fixed a personal regression in setup.py
  +
* Plan to add autogen.sh for config
  +
* Consider using virtualenv for rootless installations
  +
* Fixed installation instructions
  +
* SVN and git now synched
  +
 
= Proposal =
 
= Proposal =
 
==Interest in Machine Translation==
 
==Interest in Machine Translation==

Revision as of 17:18, 9 May 2011

Menu

Notes

Week 1 — 25th April

  • Must demonstrate that setuptools can allow a prefix-based installation for non-root users before end of bonding period
  • Emailed Francis a written proof of setuptools adequately meeting expectations and requirements.

Week 2 — 2nd May

  • Converted LaTeX source to Wikimedia format, and placed below this section for annotation.
  • Completed example regtest.py
  • Added Installation and Usage pages, uploaded initial files.

Week 3 — 9th May

  • Fixed a Python regression-related bug in regtest.py
  • Fixed a personal regression in setup.py
  • Plan to add autogen.sh for config
  • Consider using virtualenv for rootless installations
  • Fixed installation instructions
  • SVN and git now synched

Proposal

Interest in Machine Translation

I have put comments like these all over the text --Mlforcada 13:26, 6 May 2011 (UTC)

Since I was a high school student, I had a strong interest in languages, especially the grammatical structures that separated the languages, and I have had a keen interest in etymology from the age of 12. One major reason I enjoy the concept of machine translation is that it is conceptually a zero-cost translator. Once implemented, you can translate an unlimited number of documents without having to pay a translator a cent to get a possibly near perfect translation.

This interests me the most for countries such as the Philippines, where access to education can be limited, so having free translation software available would make it much easier to gain access to material that has never been translated into one region's native tongue.

Aren't expectations about MT a bit too optimistic here? Fortunately, it does not affect the quality of the proposal --Mlforcada 17:10, 1 May 2011 (UTC)

Interest in Apertium

Having seen the ability of the software to get very close to a near perfect translation, I do not doubt that ability of the software to meet its given goal.

Non sequitur: quality control may be excellent independently of actual translation quality --Mlforcada 17:12, 1 May 2011 (UTC)

Tasks of Interest

In order of preference:

  1. Quality control framework
This is the one I will finally be mentoring --Mlforcada 13:01, 6 May 2011 (UTC)
  1. Adopt a language pair (my tgl-ceb dictionary)

Quality Control Framework

The language I would use to implement this is Python, as I am most comfortable using this language, and due to the platform precedence of Python over PHP and its vast array of modules available for linguistic purposes, I think it would be the best choice for the implementation of a QA framework.

Please explain the concept of platform precedence in this context --Mlforcada 13:17, 6 May 2011 (UTC)
Python is more often than not installed by default than PHP on most Linux distributions. --Gekz 08:33, 7 May 2011 (UTC)
Are these Python linguistic resources free software? --Mlforcada 13:17, 6 May 2011 (UTC)
There are almost no proprietary Python modules, so yes. --Gekz 08:33, 7 May 2011 (UTC)

I propose the creation of a Python module entitled: 'Apertium Quality Assurance', with specific submodules covering each objective of regression testing, corpus generation, coverage testing, dictionary statistics and average ambiguity. This module will allow a programmer to use specific portions of the code for their own projects, while also allowing the dictionary developer access to these modules through command line front-ends not unlike the current Apertium tools.

Wouldn't it be nice if the proposal could be more specific about these? I think drafting manpages would help --Mlforcada 13:17, 6 May 2011 (UTC)

The module will be able to be run standalone, or installed into the Python library directory, and will therefore be easily used with other applications or with simple standalone frontends.

Core ApertiumQA module

The core module will be responsible for the cross-module functionality, such as logging statistics and generation of graphs.

The meaning of cross-module here is unclear. Does it mean "across ApertiumQA modules"? --Mlforcada 13:17, 6 May 2011 (UTC)

The use of graphing will allow any member of the Apertium team to gauge the development and success rate of given dictionary by simply having a look at a few statistics in a nice, visual manner, giving clear evidence of development.

An idea on which graphic formats are aimed at would help --Mlforcada 13:17, 6 May 2011 (UTC)
Well laid-out tabled data, with useful charts such as success ratio in regression tests per date. --Gekz 08:35, 7 May 2011 (UTC)

Statistics will be stored in either an sqlite database or an XML file. Examples of statistics that will be stored include:

Isn't sqlite an overkill here? I believe an XML file with style files to generate reports would be better --Mlforcada 13:17, 6 May 2011 (UTC)
Sqlite was an option, not a requirement. It can be ignored. --Gekz 08:35, 7 May 2011 (UTC)
  1. Date
  2. SVN revision
  3. Regression test error rate
In addition to giving the SVN revision number, the revision number or version of the regression test should also be given --Mlforcada 13:17, 6 May 2011 (UTC)
Agreed. --Gekz 08:35, 7 May 2011 (UTC)
  1. Coverage level
Do you refer to "naïve coverage" here? Could this finally be interfaced with the results of the project about "hidden unknown words"? --Mlforcada 13:17, 6 May 2011 (UTC)
Yes to both. --Gekz 08:35, 7 May 2011 (UTC)
  1. Test success rate

There will be further statistics added as development of the library continues, as more statistically valuable numerics become more obvious. The frontend for the module with generate a HTML file with all graphics, legends and required data in an easy to parse format.

An XHTML file? --Mlforcada
Either/or. --Gekz 08:35, 7 May 2011 (UTC)

Regression Testing

This submodule will use YAML, JSON, XML or CSV configuration to test for regressions and check specific coverage situations.

An example of English->French in YAML:

The "wikifiability" of regression testing is always a good idea. YAML uses indents à la Python does it? --Mlforcada 13:17, 6 May 2011 (UTC)
Any test format should be able to be kept in sync with the tests on the Wiki. The reason being that often people who don't use SVN/Apertium/Linux edit the tests. So I ask someone "Hey, go and fix the translations on this page" and they do, without having to install anything. The Wiki should ideally be the "highest priority" source. Meaning that if you have multiple conflicting copies, the Wiki is what should be gone by. - Francis Tyers 09:18, 7 May 2011 (UTC)
Pronoun check:
    I eat: je mange
    you eat: [tu manges, vous mangez]
    he eats: il mange
    she eats: elle mange
    one eats: on mange
    we eat:  [nous mangeons, on mange]
    they eat: [ils mangent, elles mangent]

As you may see, this syntax allows for non-programmers the ability to easily define tests in an easy-to-use syntax, without limiting programmers and others from using a syntax they are more comfortable with, such as JSON, XML or CSV.

The brackets allow one to have multiple correct response for a given test item.

The code would simply run through the tests as required, giving output as to whether their were any failures or passes, depending on settings selected (not unlike HfstTester.py). It will also allow the automatic reversal of the tests, allowing for one to run a French->English test on the same configuration without the need to needlessly rewrite the test in reverse.

Give a reference to HfstTester.py for convenience here --Mlforcada 13:17, 6 May 2011 (UTC)
[1] --Gekz 08:46, 7 May 2011 (UTC)
Please consider that regression tests may not be reversible in general -Mlforcada 13:17, 6 May 2011 (UTC)
This is true, and can easily be enforced by a configuration rule. It will be made clearer when I release the planned spec. --Gekz 08:46, 7 May 2011 (UTC)

Corpus Generator

Do you mean monolingual corpus generator?? --Mlforcada 13:17, 6 May 2011 (UTC)
At this stage, yes. As has been pointed out, it would be best to be named Corpus Extractor as it is simply making more concise and useful corpora based on user-defined heuristics. --Gekz 08:46, 7 May 2011 (UTC)

This submodule will implement basic functionality for generating corpora from any given text should the lines meet a given heuristic criteria. Examples of such user-configurable criteria include: length of sentence, acceptable limit of punctuation symbols, acceptable limit of numerals, excessive proper nouns, excessive English or other lingual terms per sentence, etc.

Explain what do you mean by "acceptable limit of punctuation symbols" or "numerals" --Mlforcada 13:17, 6 May 2011 (UTC)
User-defined heuristics control the "acceptable limit" of any input. Basically a regular expression plus a limit, and if you pass that limit, the sentence is disregarded as "not meeting the criteria" --Gekz 08:46, 7 May 2011 (UTC)
A crazy idea: Bitextor-mediated bilingual input for semi-automatic generation of regression testing? --Mlforcada 13:17, 6 May 2011 (UTC)
Possible. I will look into it if I find the time. --Gekz 08:46, 7 May 2011 (UTC)

An example output line would be:

1. The quick brown fox jumps over the lazy dog.

As you can see, this is not unlike the corpus that can be found in en-eo.

A specific subclass of this module will be created for generating Wikipedia-based corpuses due to the significant differences between plain text and Wikimedia markup, and the fact it requires parsing XML in rather great depth. It will make use of a modified version of esperantowiki-xml2txt.py for parsing the Wikimedia markup.

Corpus Testing

This submodule will be not unlike that of the testcorpus_en-eo.sh that can be found in apertium-eo-en, except it will be reimplemented in Python and will be easily user-configurable.

Explain this in a standalone way --Mlforcada 13:17, 6 May 2011 (UTC)

Coverage Testing

This submodule will be not unlike corpus-stat-en-eo.sh in that it will give a count of tokenised words, a count of unknown words, and a list of unknown words, and give a calculation of the coverage.

I assume you talk here about naïve coverage --Mlforcada 13:17, 6 May 2011 (UTC)
Again, yes. --Gekz 08:46, 7 May 2011 (UTC)

Average Ambiguity

The average is not enough. More detailed statistics are easy to gather and useful --Mlforcada 13:17, 6 May 2011 (UTC)

This submodule will use Apertium tools to find ambiguity in output, get the average, and then output a sorted descending list of highest to lowest ambiguity.

A list of what? --Mlforcada 13:17, 6 May 2011 (UTC)

Possible Timetable

It would be very useful to put absolute dates here, that is, regular month and day dates --Mlforcada 13:20, 6 May 2011 (UTC)
  • Week 1 -- 4 & implement core methods of all listed modules above
  • Deliverable 1: an alpha/beta-quality QA library and frontend tools
  • Week 5 -- 8: ensure all stubs and TODOs are completed, all goals met
  • Deliverable 2: an RC-quality QA library and frontend tools
What is RC? --Mlforcada
  • Week 9 -- 12 & real world testing, extra features beyond core specification
  • Deliverable 3: a production-quality QA library and frontend tools

Important Dates:

  • June 10: Autumn semester ends
  • June 11: Exam session commences
  • July 1: Exam session ends
  • August 1: Spring semester begins

The beginning of GSoC overlaps with the final weeks of my Autumn semester, however, this will be no issue, as I currently only attend University twice a week, and have plenty of free time to spend working on GSoC. I will have two examinations during the exam period, one being a Java exam and one being a Mathematics exam. During the week leading up to the Mathematics examination, I may be a little difficult to contact and/or spend very little time on GSoC. I will make up for this by working much harder in the later weeks.

As for commercial work, I am a sysadmin and lecturer to high school teachers on how to best use their equipment, so a few hours a week I may be working, although this should not clash with GSoC whatsoever.

What kind of "sys" do you "admin"? --Mlforcada 13:20, 6 May 2011 (UTC)

As you can see, my 'summer' ends approximately 3 weeks before the completion of GSoC. At this point I cannot guarantee what my University timetable will look like whatsoever, but I can assure you that even in the case where I have no time during the week to work on GSoC, I will make full use of my weekends to complete whichever parts of the project that are incomplete, although, as you can see by my proposed timetable, it is unlikely there will be too much work to be done in the final 4 weeks.

Qualifications

I have been studying and using Python constantly for at least three years, dabbling with open source projects all the way through. I have submitted patches to the GemRB project, and I worked with debian-installer in order to create one of the first working installers for the EeePC 701 which lacked a CD-ROM drive. I have also previously worked with Apertium, such as with the creation of apertium-verbconj, or my adoption of the Tagalog-Cebuano language pair.

It would be nice to name the other projects, so that we do cross-propaganda --Mlforcada 13:25, 6 May 2011 (UTC)
Hey! I own one of them EeePC 701s! I'm interested! --Mlforcada 13:25, 6 May 2011 (UTC)

In order to conjugate Tagalog verbs correctly, it was required that we use HFST as Apertium does not well support infixation. As it turns out, twolc is one of the most painful syntaxes I have ever experienced in my life, so I attempted to implement infixing using nothing but lexc. I completed this task through the liberal use of flag diacritics.

References to twolc needed --Mlforcada 13:25, 6 May 2011 (UTC)
"twolc is" or "twolc has" --Mlforcada 13:25, 6 May 2011 (UTC)
"liberal use of flag diacritics": "liberal" but "systematic"? --Mlforcada 13:25, 6 May 2011 (UTC)

For testing out my HFST dictionary, I implemented =HfstTester.py= , which is being used by divvun.no and Sjur Moshagen is constantly in contact with me requesting new features and recommending changes with my code. They have made their own mirror of my code, and there is even a whole page explaining its use and feature wishlist. The skills I learnt implementing this and working with Sjur Moshagen can easily be transferred to a Quality Assurance framework for Apertium.

Add inline web links to your stuff (instead of a list at the end) --Mlforcada 13:25, 6 May 2011 (UTC)

I have implemented a test application that creates a semi-working corpus from a wikipedia dump. Firstly, you download the wikipedia dump you wish you use. You then simply run the script with the first parameter being the dictionary and the second one being your output file.

I am not sure I understand this. See my comment above about manpages. --Mlforcada 13:25, 6 May 2011 (UTC)

An optional third parameter limits the amount of lines of output. The application makes use of NLTK to parse for sentences, and uses a very rudimentary wikimedia syntax stripper that needs much more work to be considered anything than test code. The output quality for languages such as Norwegian or English is very good, as compared with Wikipedia's such as Tagalog, where the general article quality is much lower in both content and in (ab)use of syntax.

"wikimedia syntax stripper": can't you get plain text wikimedia dumps? --Mlforcada 13:25, 6 May 2011 (UTC)
"to parse for sentences": what kind of parsing? --Mlforcada 13:25, 6 May 2011 (UTC)

As I study ICT Engineering and International Studies (BEng DipEngPrac BArts) at University of Technology, Sydney, I have a broad range of subjects at my disposal. A unit called 'Introduction to Digital Systems' went into great detail about the mathematics behind finite state systems and we were required to learn PIC assembler. I have found these two skills to be invaluable for having a firm grounding in how a language pair works and allowed me to have a clear idea of how I would implement verb conjugation for Tagalog using lexc.

What is PIC assembler? URL? --Mlforcada 13:25, 6 May 2011 (UTC)
lexc? hfst? shouldn't we improve lttoolbox to provide all that functionality? --Mlforcada 13:25, 6 May 2011 (UTC)

In order of strength, programming languages I can use are: Python, Java, Vala, C, C++ and a smattering of interpreted languages like Perl and PHP.

Conclusion

I look forward to correspondence with the larger Apertium team and hope that this year I may make the most of Google Summer of Code in assisting Apertium with their goals.

Sources