Difference between revisions of "Apertium-kaz-tat/paper"

From Apertium
Jump to navigation Jump to search
(→‎JNW: r45320)
Line 55: Line 55:
*** kk:Астана vs. tt:Астана.
*** kk:Астана vs. tt:Астана.


=== to sort ===
=== incorporated above ===
* Lemma-by-lemma analysis
** ''We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ?'' —FMT
** In some cases of ambiguity this could happen. —JNW

=== incorporated above / unneeded ===
* justification stuff
* justification stuff
** <s>The gain (if any) when using the current system followed by post-editing compared to translating from scratch</s>
** <s>The gain (if any) when using the current system followed by post-editing compared to translating from scratch</s>
Line 68: Line 63:
** <s>''BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid.'' —FMT</s>
** <s>''BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid.'' —FMT</s>


=== unincorporated ===
* add latin-alphabet transcriptions? ☹
* add latin-alphabet transcriptions? ☹
** Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —[[User:Firespeaker|Firespeaker]] 18:15, 1 June 2013 (UTC)
** Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —[[User:Firespeaker|Firespeaker]] 18:15, 1 June 2013 (UTC)
** Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —[[User:Firespeaker|Firespeaker]] 18:15, 1 June 2013 (UTC)
** Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —[[User:Firespeaker|Firespeaker]] 18:15, 1 June 2013 (UTC)

* Lemma-by-lemma analysis
** ''We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ?'' —FMT
** In some cases of ambiguity this could happen. —JNW


== TODO (Original submission) ==
== TODO (Original submission) ==

Revision as of 22:17, 27 June 2013

We're submitting a paper on apertium-kaz-tat to MT Summit 2013. DEADLINE: APRIL 22.

We're revising our paper that was accepted to MT Summit 2013. DEADLINE: JULY 1st.

TODO (Camera-ready)

Ilnar

  • 0.1.0 stable release
    • Go through FIXMEs and CHECKs with JNW to see how many we can clear up
    • Discuss дағы<cnjcoo>/<postadv> thing with JNW
    • Did we ever implement the <adj><cop> stuff we discussed? Both -kaz and -tat need it.
  • tat→kaz eval
    • Get казашки to post-edit Kazakh stuff
  • You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.
  • Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
    • We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ? —FMT (We should exclude the HFST bug in this.)
  • calculate gain in person hours using system over translating by hand?

Algorithm for checking dictionaries (as part of the testvocing)

  • Go through entries in bidix
    • Get rid of duplications, FIXME's, alternatively spelled variants (handling them in lexc instead)
  • Look up bidix stems in lexc's and make sure every one of them is in there (and we don't loose any bit of coverage)
  • Try to get rid of FIXME's for stems in lexc's
  • Have a quick look at continuation lexicons and make sure they follow standards (some of the lexicon names in tat.lexc in particular)
  • Run testvoc (either for each different class separately -- commenting out all other root lexicons -- or without modifying anything if it doesn't take forever)

Fran

  • Help calculate BLEU score and explain why it's stupid.

JNW

  • new style file
  • take out comparative phonology stuff?
  • fix some of the last few bits of Tatar phonology?
  • Explain that the irregular A-stem verbs in Tatar are a lexical property of the verb
    • do the two verb classes in Tatar differ for some other reason than the last vowel in the verb stem (that is, is class membership a lexical property or a phonological property and if the latter, why is this a lexicon distinction and not a morphophonological issue?)
  • fix all referents to revision numbers after we have a stable release that we're testing on
  • Minor stuff
    • 2nd -> write 2^(nd) or second
    • table 1 -> Table 1
    • Fig. 1 -> Figure 1
    • based on context rules the -> based on context rules, the
    • structural transfer module which -> module, which
    • CG and Constraint Grammar forms used
    • Table 4 does not appear in running text.
    • The contents of Sections 2 and 3 do not correspond to the description given in Section 1.
  • justification/motivation for the system
    • gain in person hours using system over translating by hand
    • For a marginalised language community to be able to communicate externally without going through the majority language.
    • An MT system gives a good opportunity to work on other language resources, e.g. morph. analyser, disambiguator, etc. --- these resources can be used for other things.
    • The time-effort : reward pay off is high. With a few months, we can get an effective system. This would not be the case for unrelated languages.
    • Why going through Russian isn't "better": (This is retarded, we should probably put in a comment on why this is stupid. —FMT)
      • Translate documents/sources/etc. efficiently without finding/paying someone to translate it to Russian first
    • more text on wikipedia in Kazakh (and generally more "human knowledge" available, maybe):
      • kk.wiki 200,000 articles, tt.wiki <50k
      • kk:Астана vs. tt:Астана.

incorporated above

  • justification stuff
    • The gain (if any) when using the current system followed by post-editing compared to translating from scratch
      • Could we do a small time-based experiment ? E.g. time taken to translate 2000 words with/without the system ? —FMT
  • BLEU score
    • We can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions. —FMT
    • BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid. —FMT

unincorporated

  • add latin-alphabet transcriptions? ☹
    • Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —Firespeaker 18:15, 1 June 2013 (UTC)
    • Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —Firespeaker 18:15, 1 June 2013 (UTC)
  • Lemma-by-lemma analysis
    • We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ? —FMT
    • In some cases of ambiguity this could happen. —JNW

TODO (Original submission)

Ideal benchmarks:

  • document rules in the rlx with example sentences
  • more like 100-150 (currently ~40) disambiguation rules in -kaz

Ilnar

  • Development corpus (lots and lots of text)
    • Work on increasing coverage (via lexc) and trimmed coverage (via dix) to 90%
    • Work on making sure testvoc passes
      • i.e., corpus testvoc
    • add rules — disambigation (CG), lexical selection, and transfer.
  • Test corpus (about 10 pages; don't base rules on this text!)
  • Paper
    • Add affiliation to paper
    • Help JNW come up with some more contrastive stuff (see below / FIXME: Ilnars in paper)
      • Tatar equivalent of барайын деп жатырмын "I'm planning on going" ?
      • What was it that you noticed with -ғалы/-гелі (and its correspondent in Tatar)?
    • Find some exemplary bidix entries for figure 2.
    • New example for table 3
      • maybe Kazakh equivalent of original sentence: "Ауа райы бүгін өте/әбден жақсы, жылы."
      • maybe "Ол енді ол дыбысты анығырақ ести бастады" (some good ambiguity). Unfortunately, current output is "Ул иңне ул тавышны аныграк ишетә башлады". Could we fix this?

Fran

  • Delegate out error-rate testing tasks
  • new version of Table 2

JNW

  • Work on last few issues in -tat twol
  • Write up background
  • Contrastive analysis of Kazakh and Tatar
    • phonological differences (a generalised summary, 2 or 3 small specific examples)
    • orthographical differences (a generalised summary, 1 or 2 small specific examples)
    • lexical and morphological differences (2 or 3 specific examples)
    • morphotactic differences (2 or 3 specific examples)
    • syntactic differences (2 or 3 specific examples)
  • Coverage stuff
    • divide corpora into 10 pieces and run coverage for each to get stddev

Over-all

Abstract 1 2 3 3.1 3.2 3.3 3.4 4 4.1 4.2 4.3 4.4 4.5 5 5.1 6 Acknowledgements References