Difference between revisions of "Apertium-kaz-tat/paper"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) m (→to sort) |
Firespeaker (talk | contribs) (→TODO (Camera-ready): some rearranging) |
||
Line 11: | Line 11: | ||
* '''tat→kaz eval''' |
* '''tat→kaz eval''' |
||
** Get казашки to post-edit Kazakh stuff |
** Get казашки to post-edit Kazakh stuff |
||
⚫ | |||
⚫ | |||
=== JNW === |
=== JNW === |
||
Line 30: | Line 32: | ||
** Table 4 does not appear in running text. |
** Table 4 does not appear in running text. |
||
** ''The contents of Sections 2 and 3 do not correspond to the description given in Section 1.'' |
** ''The contents of Sections 2 and 3 do not correspond to the description given in Section 1.'' |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
* justification/motivation for the system |
* justification/motivation for the system |
||
** The gain (if any) when using the current system followed by post-editing compared to translating from scratch |
** The gain (if any) when using the current system followed by post-editing compared to translating from scratch |
||
Line 51: | Line 43: | ||
*** kk.wiki 200,000 articles, tt.wiki <50k |
*** kk.wiki 200,000 articles, tt.wiki <50k |
||
*** kk:Астана vs. tt:Астана. |
*** kk:Астана vs. tt:Астана. |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
* ''You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.'' |
* ''You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.'' |
||
Revision as of 06:13, 6 June 2013
We're submitting a paper on apertium-kaz-tat to MT Summit 2013. DEADLINE: APRIL 22.
We're revising our paper that was accepted to MT Summit 2013. DEADLINE: JULY 1st.
Contents
TODO (Camera-ready)
Ilnar
- 0.1.0 stable release
- Go through FIXMEs and CHECKs with JNW to see how many we can clear up
- Discuss дағы
<cnjcoo>
/<postadv>
thing with JNW - Did we ever implement the
<adj>
+е<cop>
stuff we discussed? Both -kaz and -tat need it.
- tat→kaz eval
- Get казашки to post-edit Kazakh stuff
- Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
- We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ? —FMT (We should exclude the HFST bug in this.)
JNW
- take out comparative phonology stuff?
- fix some of the last few bits of Tatar phonology?
- add latin-alphabet transcriptions? ☹
- Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —Firespeaker 18:15, 1 June 2013 (UTC)
- Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —Firespeaker 18:15, 1 June 2013 (UTC)
- Explain that the irregular A-stem verbs in Tatar are a lexical property of the verb
- do the two verb classes in Tatar differ for some other reason than the last vowel in the verb stem (that is, is class membership a lexical property or a phonological property and if the latter, why is this a lexicon distinction and not a morphophonological issue?)
- fix all referents to revision numbers after we have a stable release that we're testing on
- Minor stuff
- 2nd -> write 2^(nd) or second
- table 1 -> Table 1
- Fig. 1 -> Figure 1
- based on context rules the -> based on context rules, the
- structural transfer module which -> module, which
- CG and Constraint Grammar forms used
- Table 4 does not appear in running text.
- The contents of Sections 2 and 3 do not correspond to the description given in Section 1.
- justification/motivation for the system
- The gain (if any) when using the current system followed by post-editing compared to translating from scratch
- Could we do a small time-based experiment ? E.g. time taken to translate 2000 words with/without the system ? —FMT
- For a marginalised language community to be able to communicate externally without going through the majority language.
- An MT system gives a good opportunity to work on other language resources, e.g. morph. analyser, disambiguator, etc. --- these resources can be used for other things.
- The time-effort : reward pay off is high. With a few months, we can get an effective system. This would not be the case for unrelated languages.
- Why going through Russian isn't "better": (This is retarded, we should probably put in a comment on why this is stupid. —FMT)
- Translate documents/sources/etc. efficiently without finding/paying someone to translate it to Russian first
- more text on wikipedia in Kazakh (and generally more "human knowledge" available, maybe):
- kk.wiki 200,000 articles, tt.wiki <50k
- kk:Астана vs. tt:Астана.
- The gain (if any) when using the current system followed by post-editing compared to translating from scratch
to sort
- BLEU score
- We can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions. —FMT
- BLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid. —FMT
- Lemma-by-lemma analysis
- We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ? —FMT
- In some cases of ambiguity this could happen. —JNW
- You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.
TODO (Original submission)
Ideal benchmarks:
- document rules in the rlx with example sentences
- more like 100-150 (currently ~40) disambiguation rules in -kaz
Ilnar
- Development corpus (lots and lots of text)
Work on increasing coverage (via lexc) and trimmed coverage (via dix) to 90%- Work on making sure testvoc passes
- i.e., corpus testvoc
- add rules — disambigation (CG), lexical selection, and transfer.
Test corpus (about 10 pages; don't base rules on this text!)Make a gold standard translation/correct some tests for error-rate testingDone.
- Paper
Add affiliation to paper- Help JNW come up with some more contrastive stuff
(see below / FIXME: Ilnars in paper)- Tatar equivalent of барайын деп жатырмын "I'm planning on going" ?
- What was it that you noticed with -ғалы/-гелі (and its correspondent in Tatar)?
- Find some exemplary bidix entries for figure 2.
- New example for table 3
- maybe Kazakh equivalent of original sentence: "Ауа райы бүгін өте/әбден жақсы, жылы."
- maybe "Ол енді ол дыбысты анығырақ ести бастады" (some good ambiguity). Unfortunately, current output is "Ул иңне ул тавышны аныграк ишетә башлады". Could we fix this?
Fran
- Delegate out error-rate testing tasks
new version of Table 2
JNW
- Work on last few issues in -tat twol
Write up background- Contrastive analysis of Kazakh and Tatar
phonological differences (a generalised summary, 2 or 3 small specific examples)orthographical differences (a generalised summary, 1 or 2 small specific examples)lexical and morphological differences (2 or 3 specific examples)morphotactic differences (2 or 3 specific examples)- syntactic differences (2 or 3 specific examples)
- Coverage stuff
- divide corpora into 10 pieces and run coverage for each to get stddev
Over-all
Abstract 1 2 3 3.1 3.2 3.3 3.4 4 4.1 4.2 4.3 4.4 4.5 5 5.1 6 Acknowledgements References