Difference between revisions of "User:Ilnar.salimzyan/GSoC2014"

Revision as of 23:25, 14 May 2014

Post-application period

* work on the 'James and Mary' translation
    ** get rid of the debugging symbols
    ** get the baseline WER
* get permission to use one of the modern government-funded Tatar-Russian
  dictionaries under a free license and digitize it or fall back to one of
  the dictionaries in the public domain and scan that
* read documentation on chunking based-transfer and papers describing other
  Apertium pairs for distant languages

'James and Mary' translation

Story is in corpus/corpus.tat.txt (first 50 lines). There are no [*@#] errors as of r52944. WER is 71.84%, PER 55.26%.

Bilingual dictionary

At least we can look up stems from apertium-tat. I am working on getting something bigger than that.

Literature review

~~Chunking~~ and ~~Chunking: A full example~~
- Probably I should stick to the SN and SV convention?

sme-nob paper, eus-eng paper, eng-kaz paper
- Ideas to try out:
  - use macros, but try to avoid variables. "Adapt" macros as seen in some of the hbs pairs can help with that.

Other thoughts:

acceptance tests for an Aperitum MT system are: regression tests on the wiki, corpus test (WER and number of [*@#] errors) and testvoc. Unit testing an Apertium MT system is testing its modules (modes). Figure out how to unit test each module.
- one should be able to run his tests without the internet connection. Keeping a copy of the 'regression tests' html page in the /dev solves this problem, but it doesn't allow us to add new tests while not having internet access. One way to deal with that is to have a local copy of regression tests in the wiki format, so that if you add new test while flying over the atlantic, you can copy paste them to the wiki page of the pair later.

Community-bonding period

'''Deliverables 0:'''
# testvoc script(s) which doesn't take forever to run (consider footnote #5 in the proposal)
# a way to test the apertium-rus generator
# a digital dictionary under a free license or ocr'd public domain dictionary
# parallel corpus in /corpus (=development corpus) is expanded with texts which represent domains 
  the system could potentially be applied to (500 sentences?)
# tat-rus-t1x.test, tat-rus-t2x.test, tat-rus-t3x.test and tat-rus-transfer.test which will run all three
# multiword pending tests on the wiki which kind of cover the core of the desired functionality (at least
  52 "sentence models" listed in the "Tatar Syntax" book)
# "workplan" and "current state" tables on [[Tatar and Russian]] page (on week end, 17-18 May)

Testvoc

Ok, the usual testvoc (see apertium-tat-rus/testvoc/standard) works and so far doesn't take too much time to run. We've also set up prefixing system in apertium-tat/tests/morphotactics which, for one word per pardef, provides a text file with the full paradigm of that word.

The apertium-tat-rus/testvoc/lite, which is supposed to take that text files from apertium-tat, extract LU's, run them through inconsistency.sh and generate testvoc-summary file where each line represents stats about each text file doesn't work yet. For the time being, I can can do that manually, and simply grep for "[@#]" errors.

If time permits, would be good to set it up, and then to use the same inconsistency.sh for all *three* types of testvocs -- standard, lite and corpus.

Difference between revisions of "User:Ilnar.salimzyan/GSoC2014"

Revision as of 23:25, 14 May 2014

Contents

Post-application period

'James and Mary' translation

Bilingual dictionary

Literature review

Community-bonding period

Testvoc

Russian generator

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 9: / Line 9: @@
 == Post-application period ==
+<pre>
 * work on the 'James and Mary' translation
-** <s>get rid of the debugging symbols</s>
+    ** get rid of the debugging symbols
-** get the baseline WER
+    ** get the baseline WER
-* get permission to use one of the modern government-funded Tatar-Russian dictionaries under a free license and digitize it or fall back to one of the dictionaries in the public domain and scan that
+* get permission to use one of the modern government-funded Tatar-Russian
+  dictionaries under a free license and digitize it or fall back to one of
-* read documentation on chunking based-transfer and papers describing other Apertium pairs for distant languages
+  the dictionaries in the public domain and scan that
-** <s>[[Chunking]]</s>, <s>[[Chunking: A full example]]</s>, sme-nob paper, eus-eng paper, eng-kaz paper.
+* read documentation on chunking based-transfer and papers describing other
+  Apertium pairs for distant languages
+</pre>
+=== 'James and Mary' translation ===
+Story is in corpus/corpus.tat.txt (first 50 lines). There are no [*@#] errors as of r52944. WER is 71.84%, PER 55.26%.
+=== Bilingual dictionary ===
+At least we can look up stems from apertium-tat. I am working on getting something bigger than that.
+=== Literature review ===
+* <s>[[Chunking]]</s> and <s>[[Chunking: A full example]]</s>
+** Probably I should stick to the SN and SV convention?
+* sme-nob paper, eus-eng paper, eng-kaz paper
+** Ideas to try out:
+*** use macros, but try to avoid variables. "Adapt" macros as seen in some of the hbs pairs can help with that.
+'''Other thoughts:'''
 * acceptance tests for an Aperitum MT system are: regression tests on the wiki, corpus test (WER and number of [*@#] errors) and testvoc. Unit testing an Apertium MT system is testing its modules (modes). Figure out how to unit test each module.
 ** one should be able to run his tests without the internet connection. Keeping a copy of the 'regression tests' html page in the /dev solves this problem, but it doesn't allow us to add new tests while not having internet access. One way to deal with that is to have a local copy of regression tests in the wiki format, so that if you add new test while flying over the atlantic, you can copy paste them to the wiki page of the pair later.
@@ Line 20: / Line 42: @@
 == Community-bonding period ==
+<pre>
 '''Deliverables 0:'''
 # testvoc script(s) which doesn't take forever to run (consider footnote #5 in the proposal)
+# a way to test the apertium-rus generator
-# ocr'd public domain dictionary
+# a digital dictionary under a free license or ocr'd public domain dictionary
-# parallel corpus in /corpa is expanded with texts which represent domains the system could potentially be applied to (500 sentences?)
+# parallel corpus in /corpus (=development corpus) is expanded with texts which represent domains
+  the system could potentially be applied to (500 sentences?)
+# tat-rus-t1x.test, tat-rus-t2x.test, tat-rus-t3x.test and tat-rus-transfer.test which will run all three
+# multiword pending tests on the wiki which kind of cover the core of the desired functionality (at least
+"sentence models" listed in the "Tatar Syntax" book)
+# "workplan" and "current state" tables on [[Tatar and Russian]] page (on week end, 17-18 May)
+</pre>
+=== Testvoc ===
+Ok, the usual testvoc (see apertium-tat-rus/testvoc/standard) works and so far doesn't take too much time to run. We've also set up prefixing system in apertium-tat/tests/morphotactics which, for one word per pardef, provides a text file with the full paradigm of that word.
+The apertium-tat-rus/testvoc/lite, which is supposed to take that text files from apertium-tat, extract LU's, run them through inconsistency.sh and generate testvoc-summary file where each line represents stats about each text file doesn't work yet. For the time being, I can can do that manually, and simply grep for "[@#]" errors.
+If time permits, would be good to set it up, and then to use the same inconsistency.sh for all *three* types of testvocs -- standard, lite and corpus.
+=== Russian generator ===