Difference between revisions of "Apertium-kaz-tat/paper"

From Apertium
Jump to navigation Jump to search
m
 
(26 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<s>We're submitting a paper on [[apertium-kaz-tat]] to [http://www.mtsummit2013.info/impdates.asp MT Summit 2013]. DEADLINE: APRIL 22.</s>
+
Our paper was accepted to [http://www.mtsummit2013.info/impdates.asp MT Summit 2013]. You can read it [http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-kaz-tat/paper/ here].
   
We're revising our paper that was accepted to [http://www.mtsummit2013.info/impdates.asp MT Summit 2013]. DEADLINE: JULY 1st.
 
 
== TODO (Camera-ready) ==
 
=== Ilnar ===
 
* 0.1.0 stable release
 
** Go through FIXMEs and CHECKs with JNW to see how many we can clear up
 
** Discuss дағы{{tag|cnjcoo}}/{{tag|postadv}} thing with JNW
 
** Did we ever implement the {{tag|adj}}+е{{tag|cop}} stuff we discussed? Both -kaz and -tat need it.
 
* '''tat→kaz eval'''
 
** Get казашки to post-edit Kazakh stuff
 
 
=== JNW ===
 
* '''take out comparative phonology stuff?'''
 
* fix some of the last few bits of Tatar phonology?
 
* add latin-alphabet transcriptions? ☹
 
** Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —[[User:Firespeaker|Firespeaker]] 18:15, 1 June 2013 (UTC)
 
** Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —[[User:Firespeaker|Firespeaker]] 18:15, 1 June 2013 (UTC)
 
* Explain that the irregular A-stem verbs in Tatar are a lexical property of the verb
 
** ''do the two verb classes in Tatar differ for some other reason than the last vowel in the verb stem (that is, is class membership a lexical property or a phonological property and if the latter, why is this a lexicon distinction and not a morphophonological issue?)''
 
* fix all referents to revision numbers after we have a stable release that we're testing on
 
* Minor stuff
 
** 2nd -> write 2^(nd) or second
 
** table 1 -> Table 1
 
** Fig. 1 -> Figure 1
 
** based on context rules the -> based on context rules, the
 
** structural transfer module which -> module, which
 
** CG and Constraint Grammar forms used
 
** Table 4 does not appear in running text.
 
** ''The contents of Sections 2 and 3 do not correspond to the description given in Section 1.''
 
 
=== to sort ===
 
* BLEU score
 
** ''We can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions.'' —FMT
 
** ''BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid.'' —FMT
 
 
* Lemma-by-lemma analysis
 
** ''We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ?'' —FMT
 
** In some cases of ambiguity this could happen. —JNW
 
 
* justification/motivation for the system
 
** The gain (if any) when using the current system followed by post-editing compared to translating from scratch
 
*** ''Could we do a small time-based experiment ? E.g. time taken to translate 2000 words with/without the system ?'' —FMT
 
** For a marginalised language community to be able to communicate externally without going through the majority language.
 
** An MT system gives a good opportunity to work on other language resources, e.g. morph. analyser, disambiguator, etc. --- these resources can be used for other things.
 
** The time-effort : reward pay off is high. With a few months, we can get an effective system. This would not be the case for unrelated languages.
 
** Why going through Russian isn't "better": (''This is retarded, we should probably put in a comment on why this is stupid.'' —FMT)
 
*** Translate documents/sources/etc. efficiently without finding/paying someone to translate it to Russian first
 
** more text on wikipedia Kazakh (and generally more "human knowledge" available, maybe):
 
*** kk.wiki 200,000 articles, tt.wiki <50k
 
*** kk:Астана vs. tt:Астана.
 
* Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
 
** ''We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ?'' —FMT (''We should exclude the HFST bug in this.'')
 
* ''You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.''
 
 
== TODO (Original submission) ==
 
Ideal benchmarks:
 
* document rules in the rlx with example sentences
 
* more like 100-150 (currently ~40) disambiguation rules in -kaz
 
 
=== Ilnar ===
 
* Development corpus (lots and lots of text)
 
** <s>Work on increasing coverage (via lexc) and trimmed coverage (via dix) to 90%</s>
 
** Work on making sure testvoc passes
 
*** i.e., corpus testvoc
 
** add rules — [[Apertium-kaz-tat/Ideas_for_Disambiguation_Rules|disambigation]] (CG), lexical selection, and transfer.
 
* <s>Test corpus (about 10 pages; don't base rules on this text!)</s>
 
** <s>Make a gold standard translation/correct some tests for [[Evaluation|error-rate testing]]</s> [http://pastebin.com/7tzGCgMX Done.]
 
* Paper
 
** <s>Add affiliation to paper</s>
 
** Help JNW come up with some more contrastive stuff <s>(see below / <tt>FIXME: Ilnar</tt>s in paper)</s>
 
*** Tatar equivalent of барайын деп жатырмын "I'm planning on going" ?
 
*** What was it that you noticed with -ғалы/-гелі (and its correspondent in Tatar)?
 
** Find some exemplary bidix entries for figure 2.
 
** New example for table 3
 
*** maybe Kazakh equivalent of original sentence: "Ауа райы бүгін өте/әбден жақсы, жылы."
 
*** maybe "Ол енді ол дыбысты анығырақ ести бастады" (some good ambiguity). Unfortunately, current output is "Ул иңне ул тавышны аныграк ишетә башлады". Could we fix this?
 
 
=== Fran ===
 
* Delegate out error-rate testing tasks
 
* <s>new version of Table 2</s>
 
 
=== JNW ===
 
* Work on last few issues in -tat twol
 
* <s>Write up background</s>
 
* Contrastive analysis of Kazakh and Tatar
 
** <s>phonological differences (a generalised summary, 2 or 3 small specific examples)</s>
 
** <s>orthographical differences (a generalised summary, 1 or 2 small specific examples)</s>
 
** <s>lexical and morphological differences (2 or 3 specific examples)</s>
 
** <s>morphotactic differences (2 or 3 specific examples)</s>
 
** syntactic differences (2 or 3 specific examples)
 
* Coverage stuff
 
** divide corpora into 10 pieces and run coverage for each to get stddev
 
 
=== Over-all ===
 
<s>Abstract</s> <s>1 2 3 3.1 3.2 3.3</s> <s>3.4</s> <s>4 4.1 4.2</s> 4.3 4.4 4.5 5 5.1 6 Acknowledgements <s>References</s>
 
   
 
[[Category:Kazakh and Tatar|*]]
 
[[Category:Kazakh and Tatar|*]]

Latest revision as of 12:59, 16 March 2014

Our paper was accepted to MT Summit 2013. You can read it here.