Difference between revisions of "Apertium-kaz-tat/paper"

Revision as of 06:13, 6 June 2013

~~We're submitting a paper on apertium-kaz-tat to MT Summit 2013. DEADLINE: APRIL 22.~~

We're revising our paper that was accepted to MT Summit 2013. DEADLINE: JULY 1st.

TODO (Camera-ready)

Ilnar

0.1.0 stable release
- Go through FIXMEs and CHECKs with JNW to see how many we can clear up
- Discuss дағы<cnjcoo>/<postadv> thing with JNW
- Did we ever implement the <adj>+е<cop> stuff we discussed? Both -kaz and -tat need it.
tat→kaz eval
- Get казашки to post-edit Kazakh stuff
Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
- We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ? —FMT (We should exclude the HFST bug in this.)

JNW

take out comparative phonology stuff?
fix some of the last few bits of Tatar phonology?
add latin-alphabet transcriptions? ☹
- Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —Firespeaker 18:15, 1 June 2013 (UTC)
- Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —Firespeaker 18:15, 1 June 2013 (UTC)
Explain that the irregular A-stem verbs in Tatar are a lexical property of the verb
- do the two verb classes in Tatar differ for some other reason than the last vowel in the verb stem (that is, is class membership a lexical property or a phonological property and if the latter, why is this a lexicon distinction and not a morphophonological issue?)
fix all referents to revision numbers after we have a stable release that we're testing on
Minor stuff
- 2nd -> write 2^(nd) or second
- table 1 -> Table 1
- Fig. 1 -> Figure 1
- based on context rules the -> based on context rules, the
- structural transfer module which -> module, which
- CG and Constraint Grammar forms used
- Table 4 does not appear in running text.
- The contents of Sections 2 and 3 do not correspond to the description given in Section 1.
justification/motivation for the system
- The gain (if any) when using the current system followed by post-editing compared to translating from scratch
  - Could we do a small time-based experiment ? E.g. time taken to translate 2000 words with/without the system ? —FMT
- For a marginalised language community to be able to communicate externally without going through the majority language.
- An MT system gives a good opportunity to work on other language resources, e.g. morph. analyser, disambiguator, etc. --- these resources can be used for other things.
- The time-effort : reward pay off is high. With a few months, we can get an effective system. This would not be the case for unrelated languages.
- Why going through Russian isn't "better": (This is retarded, we should probably put in a comment on why this is stupid. —FMT)
  - Translate documents/sources/etc. efficiently without finding/paying someone to translate it to Russian first
- more text on wikipedia in Kazakh (and generally more "human knowledge" available, maybe):
  - kk.wiki 200,000 articles, tt.wiki <50k
  - kk:Астана vs. tt:Астана.

to sort

BLEU score
- We can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions. —FMT
- BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid. —FMT

Lemma-by-lemma analysis
- We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ? —FMT
- In some cases of ambiguity this could happen. —JNW

You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.

TODO (Original submission)

Ideal benchmarks:

document rules in the rlx with example sentences
more like 100-150 (currently ~40) disambiguation rules in -kaz

Ilnar

Development corpus (lots and lots of text)
- ~~Work on increasing coverage (via lexc) and trimmed coverage (via dix) to 90%~~
- Work on making sure testvoc passes
  - i.e., corpus testvoc
- add rules — disambigation (CG), lexical selection, and transfer.
~~Test corpus (about 10 pages; don't base rules on this text!)~~
- ~~Make a gold standard translation/correct some tests for error-rate testing~~ Done.
Paper
- ~~Add affiliation to paper~~
- Help JNW come up with some more contrastive stuff ~~(see below / FIXME: Ilnars in paper)~~
  - Tatar equivalent of барайын деп жатырмын "I'm planning on going" ?
  - What was it that you noticed with -ғалы/-гелі (and its correspondent in Tatar)?
- Find some exemplary bidix entries for figure 2.
- New example for table 3
  - maybe Kazakh equivalent of original sentence: "Ауа райы бүгін өте/әбден жақсы, жылы."
  - maybe "Ол енді ол дыбысты анығырақ ести бастады" (some good ambiguity). Unfortunately, current output is "Ул иңне ул тавышны аныграк ишетә башлады". Could we fix this?

Fran

Delegate out error-rate testing tasks
~~new version of Table 2~~

JNW

Work on last few issues in -tat twol
~~Write up background~~
Contrastive analysis of Kazakh and Tatar
- ~~phonological differences (a generalised summary, 2 or 3 small specific examples)~~
- ~~orthographical differences (a generalised summary, 1 or 2 small specific examples)~~
- ~~lexical and morphological differences (2 or 3 specific examples)~~
- ~~morphotactic differences (2 or 3 specific examples)~~
- syntactic differences (2 or 3 specific examples)
Coverage stuff
- divide corpora into 10 pieces and run coverage for each to get stddev

Over-all

~~Abstract~~ ~~1 2 3 3.1 3.2 3.3~~ ~~3.4~~ ~~4 4.1 4.2~~ 4.3 4.4 4.5 5 5.1 6 Acknowledgements ~~References~~

@@ Line 11: / Line 11: @@
 * '''tat→kaz eval'''
 ** Get казашки to post-edit Kazakh stuff
+* Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
+** ''We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ?'' —FMT  (''We should exclude the HFST bug in this.'')
 === JNW ===
@@ Line 30: / Line 32: @@
 ** Table 4 does not appear in running text.
 ** ''The contents of Sections 2 and 3 do not correspond to the description given in Section 1.''
-=== to sort ===
-* BLEU score
-** ''We can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions.'' —FMT
-** ''BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid.'' —FMT
-* Lemma-by-lemma analysis
-** ''We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ?'' —FMT
-** In some cases of ambiguity this could happen. —JNW
 * justification/motivation for the system
 ** The gain (if any) when using the current system followed by post-editing compared to translating from scratch
@@ Line 51: / Line 43: @@
 *** kk.wiki 200,000 articles, tt.wiki <50k
 *** kk:Астана vs. tt:Астана.
-* Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
+=== to sort ===
-** ''We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ?'' —FMT  (''We should exclude the HFST bug in this.'')
+* BLEU score
+** ''We can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions.'' —FMT
+** ''BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid.'' —FMT
+* Lemma-by-lemma analysis
+** ''We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ?'' —FMT
+** In some cases of ambiguity this could happen. —JNW
 * ''You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.''

Difference between revisions of "Apertium-kaz-tat/paper"

Revision as of 06:13, 6 June 2013

Contents

TODO (Camera-ready)

Ilnar

JNW

to sort

TODO (Original submission)

Ilnar

Fran

JNW

Over-all

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools