Difference between revisions of "User:Ilnar.salimzyan/GSoC2014"

From Apertium
Jump to navigation Jump to search
 
(19 intermediate revisions by 2 users not shown)
Line 6: Line 6:


This page is used to organize thoughts and document the development process. If you are only interested in the workplan and stats, refer to the 'Workplan' and 'Current state' sections of the [[Tatar and Russian]] page.
This page is used to organize thoughts and document the development process. If you are only interested in the workplan and stats, refer to the 'Workplan' and 'Current state' sections of the [[Tatar and Russian]] page.

== Post-application period ==


<pre>
<pre>
* work on the 'James and Mary' translation
* <s>work on the 'James and Mary' translation
** get rid of the debugging symbols
** get rid of the debugging symbols
** get the baseline WER
** get the baseline WER</s>
* get permission to use one of the modern government-funded Tatar-Russian
* get permission to use one of the modern government-funded Tatar-Russian
dictionaries under a free license and digitize it or fall back to one of
dictionaries under a free license and digitize it or fall back to one of
Line 28: Line 26:
=== Literature review ===
=== Literature review ===


(Ideas to try out and notes)
* <s>[[Chunking]]</s> and <s>[[Chunking: A full example]]</s>
** Probably I should stick to the SN and SV convention?


==== <s>[[Chunking]]</s>, <s>[[Chunking: A full example]]</s>, <s>[[N-Stage transfer]]</s> ====
* sme-nob paper, eus-eng paper, eng-kaz paper
* Probably I should stick to the SN and SV convention?
** Ideas to try out:
*: if possible, yes :) --[[User:Unhammer|unhammer]] ([[User talk:Unhammer|talk]]) 09:17, 26 May 2014 (CEST)
*** use macros, but try to avoid variables. "Adapt" macros as seen in some of the hbs pairs can help with that.


==== <s>sme-nob paper</s>, eus-eng paper, eng-kaz paper, <s>“Good Applications for Crummy Machine Translation"</s> ====

* Use macros, they make transfer files shorter and therefore more comprehensible.
'''Other thoughts:'''
* It's possible to use twol to forbid some analyses (hopefully we won't have to use that, but good to know that it's possible).
* acceptance tests for an Aperitum MT system are: regression tests on the wiki, corpus test (WER and number of [*@#] errors) and testvoc. Unit testing an Apertium MT system is testing its modules (modes). Figure out how to unit test each module.
** one should be able to run his tests without the internet connection. Keeping a copy of the 'regression tests' html page in the /dev solves this problem, but it doesn't allow us to add new tests while not having internet access. One way to deal with that is to have a local copy of regression tests in the wiki format, so that if you add new test while flying over the atlantic, you can copy paste them to the wiki page of the pair later.

== Community-bonding period ==


<pre>
<pre>
'''Deliverables 0:'''
'''Deliverables 0:'''
# testvoc script(s) which doesn't take forever to run (consider footnote #5 in the proposal)
# <s>testvoc script(s) which doesn't take forever to run (consider footnote #5 in the proposal)</s>
# a way to test the apertium-rus generator
# <s>a way to test the apertium-rus transducer</s>
# a digital dictionary under a free license or ocr'd public domain dictionary
# a digital dictionary under a free license or ocr'd public domain dictionary
# parallel corpus in /corpus (=development corpus) is expanded with texts which represent domains
# parallel corpus in /corpus (=development corpus) is expanded with texts which represent domains
the system could potentially be applied to (500 sentences?)
the system could potentially be applied to (500 sentences?)
# tat-rus-t1x.test, tat-rus-t2x.test, tat-rus-t3x.test and tat-rus-transfer.test which will run all three
# <s>t1x.test, t2x.test, t3x.test and transfer.test which will run all three</s>
# multiword pending tests on the wiki which kind of cover the core of the desired functionality (at least
# multiword pending tests on the wiki which kind of cover the core of the desired functionality
52 "sentence models" listed in the "Tatar Syntax" book)
(at least 52 "sentence models" listed in the "Tatar Syntax" book)
# "workplan" and "current state" tables on [[Tatar and Russian]] page (on week end, 17-18 May)
# <s>"workplan" and "current state" tables on [[Tatar and Russian]] page which will track progress
on things I've promised to do in the proposal (on week end, 17-18 May)</s>
</pre>
</pre>


=== Testvoc ===
=== Testvoc ===


Ok, the usual testvoc (see apertium-tat-rus/testvoc/standard) works and so far doesn't take too much time to run. We've also set up prefixing system in apertium-tat/tests/morphotactics which, for one word per pardef, provides a text file with the full paradigm of that word.
Ok, the usual testvoc (see apertium-tat-rus/testvoc/standard) works and so far doesn't take too much time to run.

We've also set up prefixing system in apertium-tat/tests/morphotactics which, for one word per pardef, provides a text file with the full paradigm of that word. The apertium-tat-rus/testvoc/lite, which is supposed to take that text files from apertium-tat, extract LU's, run them through inconsistency.sh and generate testvoc-summary file, <s>where each line represents stats about each text file,</s> works as well and runs pretty fast. It uses standard testvoc's <code>inconsistency.sh</code> and <code>inconsistency-summary.sh</code> scripts and thus generates stats for each category (i.e. nouns, adjectives etc.), not each lexicon name in apertium-tat/tests/morphotactics (i.e. N1, N-COMPOUND-PX etc.)

Corpus testvoc script is in apertium-tat-rus/testvoc/corpus.

=== A way to test the Russian generator ===

Have a look at the apertium-rus/tests/rus.test (run by './qa.sh rus' command in apertium-rus/ directory).

That test alone greatly reduces the fear to modify apertium-rus, since it provides "some "invariant" that lets us know when we've changed the behavior of the system. The key thing is that correct behavior is defined by what the set of classes did yesterday, not by any external standard of correctness". <ref>Michael Feathers (2002). Working effectively with legacy code.</ref>

=== Unit tests for transfer ===

See apertium-tat-rus/tests/t1x.test ('./qa.sh t1x' typed in apertium-tat-rus/ will run it).

== Other thoughts ==

* Acceptance tests for an Aperitum MT system are: regression tests on the wiki, corpus test (WER and number of [*@#] errors) and testvoc. Unit testing an Apertium MT system is testing its modules (modes). Figure out how to unit test each module.
** One should be able to run his tests without the internet connection. Keeping a copy of the 'regression tests' html page in the /dev solves this problem, but it doesn't allow us to add new tests while not having internet access. One way to deal with that is to have a local copy of regression tests in the wiki format, so that if you add new test while flying over the atlantic, you can copy paste them to the wiki page of the pair later.


* Try to avoid variables (i.e. use attributes where possible, not variables).
The apertium-tat-rus/testvoc/lite, which is supposed to take that text files from apertium-tat, extract LU's, run them through inconsistency.sh and generate testvoc-summary file where each line represents stats about each text file doesn't work yet. For the time being, I can can do that manually, simply grep for "[@#]" errors and automate it along the way.
** E.g. in a numeral_noun agreement rule, you want numeral's gender and case to be the same as noun's gender and case. You can of course assign nouns attribute values to variables and use these variables for the numeral. But the problem is that numerals can have different number of tags (e.g. digits only receive the <num> tag).
*** <s>"Adapt" macros as seen in r53019 of tat-rus can help with that</s> (they give numerals some default tags (different number of them, based on what kind of numeral it is): <pre>бер<num>:один<num><m><an><sg><nom>; биш<num>:пять<num><mfn><pl><nom>; 1<num>:1<num></pre> so that you can "let numeral's atribute be like noun's attribute" later in a rule).
*** Use [[Placeholder attributes]] (CD, AD etc.) instead: <pre>бер<num>:один<num><GD><AD><ND><CD>; биш<num>:пять<num><mfn><pl><CD>; 1<num>:1<num></pre> They accomplish the same thing (one numeral_noun rule will work in all three cases), but in a much simpler way.


== References ==
If time permits, would be good to set it up sooner rather than later (and also the corpus testvoc in the same apertium-tat-rus/testvoc directory).


<references/>
=== Russian generator ===

Latest revision as of 13:08, 11 June 2014

Apertium-tat-rus – machine translation system from Tatar to Russian

This page is used to organize thoughts and document the development process. If you are only interested in the workplan and stats, refer to the 'Workplan' and 'Current state' sections of the Tatar and Russian page.

* <s>work on the 'James and Mary' translation
    ** get rid of the debugging symbols
    ** get the baseline WER</s>
* get permission to use one of the modern government-funded Tatar-Russian
  dictionaries under a free license and digitize it or fall back to one of
  the dictionaries in the public domain and scan that
* read documentation on chunking based-transfer and papers describing other
  Apertium pairs for distant languages

'James and Mary' translation[edit]

Story is in corpus/corpus.tat.txt (first 50 lines). There are no [*@#] errors as of r52944. WER is 71.84%, PER 55.26%.

Bilingual dictionary[edit]

At least we can look up stems from apertium-tat. I am working on getting something bigger than that.

Literature review[edit]

(Ideas to try out and notes)

Chunking, Chunking: A full example, N-Stage transfer[edit]

  • Probably I should stick to the SN and SV convention?
    if possible, yes :) --unhammer (talk) 09:17, 26 May 2014 (CEST)

sme-nob paper, eus-eng paper, eng-kaz paper, “Good Applications for Crummy Machine Translation"[edit]

  • Use macros, they make transfer files shorter and therefore more comprehensible.
  • It's possible to use twol to forbid some analyses (hopefully we won't have to use that, but good to know that it's possible).
'''Deliverables 0:'''
# <s>testvoc script(s) which doesn't take forever to run (consider footnote #5 in the proposal)</s>
# <s>a way to test the apertium-rus transducer</s>
# a digital dictionary under a free license or ocr'd public domain dictionary
# parallel corpus in /corpus (=development corpus) is expanded with texts which represent domains 
  the system could potentially be applied to (500 sentences?)
# <s>t1x.test, t2x.test, t3x.test and transfer.test which will run all three</s>
# multiword pending tests on the wiki which kind of cover the core of the desired functionality
  (at least 52 "sentence models" listed in the "Tatar Syntax" book)
# <s>"workplan" and "current state" tables on [[Tatar and Russian]] page which will track progress
  on things I've promised to do in the proposal (on week end, 17-18 May)</s>

Testvoc[edit]

Ok, the usual testvoc (see apertium-tat-rus/testvoc/standard) works and so far doesn't take too much time to run.

We've also set up prefixing system in apertium-tat/tests/morphotactics which, for one word per pardef, provides a text file with the full paradigm of that word. The apertium-tat-rus/testvoc/lite, which is supposed to take that text files from apertium-tat, extract LU's, run them through inconsistency.sh and generate testvoc-summary file, where each line represents stats about each text file, works as well and runs pretty fast. It uses standard testvoc's inconsistency.sh and inconsistency-summary.sh scripts and thus generates stats for each category (i.e. nouns, adjectives etc.), not each lexicon name in apertium-tat/tests/morphotactics (i.e. N1, N-COMPOUND-PX etc.)

Corpus testvoc script is in apertium-tat-rus/testvoc/corpus.

A way to test the Russian generator[edit]

Have a look at the apertium-rus/tests/rus.test (run by './qa.sh rus' command in apertium-rus/ directory).

That test alone greatly reduces the fear to modify apertium-rus, since it provides "some "invariant" that lets us know when we've changed the behavior of the system. The key thing is that correct behavior is defined by what the set of classes did yesterday, not by any external standard of correctness". [1]

Unit tests for transfer[edit]

See apertium-tat-rus/tests/t1x.test ('./qa.sh t1x' typed in apertium-tat-rus/ will run it).

Other thoughts[edit]

  • Acceptance tests for an Aperitum MT system are: regression tests on the wiki, corpus test (WER and number of [*@#] errors) and testvoc. Unit testing an Apertium MT system is testing its modules (modes). Figure out how to unit test each module.
    • One should be able to run his tests without the internet connection. Keeping a copy of the 'regression tests' html page in the /dev solves this problem, but it doesn't allow us to add new tests while not having internet access. One way to deal with that is to have a local copy of regression tests in the wiki format, so that if you add new test while flying over the atlantic, you can copy paste them to the wiki page of the pair later.
  • Try to avoid variables (i.e. use attributes where possible, not variables).
    • E.g. in a numeral_noun agreement rule, you want numeral's gender and case to be the same as noun's gender and case. You can of course assign nouns attribute values to variables and use these variables for the numeral. But the problem is that numerals can have different number of tags (e.g. digits only receive the <num> tag).
      • "Adapt" macros as seen in r53019 of tat-rus can help with that (they give numerals some default tags (different number of them, based on what kind of numeral it is):
        бер<num>:один<num><m><an><sg><nom>;  биш<num>:пять<num><mfn><pl><nom>;  1<num>:1<num>
        so that you can "let numeral's atribute be like noun's attribute" later in a rule).
      • Use Placeholder attributes (CD, AD etc.) instead:
        бер<num>:один<num><GD><AD><ND><CD>;  биш<num>:пять<num><mfn><pl><CD>;  1<num>:1<num>
        They accomplish the same thing (one numeral_noun rule will work in all three cases), but in a much simpler way.

References[edit]

  1. Michael Feathers (2002). Working effectively with legacy code.