Difference between revisions of "Apertium-kaz-tat/paper"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) m (→JNW) |
m (→Ilnar) |
||
Line 12: | Line 12: | ||
* '''tat→kaz eval''' |
* '''tat→kaz eval''' |
||
** Get казашки to post-edit Kazakh stuff |
** Get казашки to post-edit Kazakh stuff |
||
* ''You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.'' |
* <s>''You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.''</s> |
||
* Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?) |
* Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?) |
||
** ''We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ?'' —FMT (''We should exclude the HFST bug in this.'') |
** ''We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ?'' —FMT (''We should exclude the HFST bug in this.'') |
Revision as of 17:00, 1 July 2013
We're submitting a paper on apertium-kaz-tat to MT Summit 2013. DEADLINE: APRIL 22.
We're revising our paper that was accepted to MT Summit 2013. DEADLINE: JULY 1st.
Contents
TODO (Camera-ready)
Ilnar
- 0.1.0 stable release
- Go through FIXMEs and CHECKs with JNW to see how many we can clear up
- Discuss дағы
<cnjcoo>
/<postadv>
thing with JNW Did we ever implement the<adj>
+е<cop>
stuff we discussed? Both -kaz and -tat need it.- Implemented as of r45395. Some disambiguation might be needed though (<adj.subst>+cop at the end of a sentence?)--selimcan 10:23, 29 June 2013 (UTC)
- tat→kaz eval
- Get казашки to post-edit Kazakh stuff
You mention that the output of the morphological analyser is ambiguous for Kazakh. Is it similar for Tatar? Since the MT system works in both directions this information should be provided.- Provide the effort in terms of person months devoted to build the current system (+ The effort required to increase the coverage to 95%?)
- We should give this information. The amount of time should be calculated in full-time months. So it's probably around 4-5 ? —FMT (We should exclude the HFST bug in this.)
- calculate gain in person hours using system over translating by hand?
Algorithm for checking dictionaries (as part of the testvocing)
- Go through entries in bidix
- Get rid of duplications, FIXME's, alternatively spelled variants (handling them in lexc instead)
- Look up bidix stems in lexc's and make sure every one of them is in there (and we don't loose any bit of coverage)
- Try to get rid of FIXME's for stems in lexc's
- Have a quick look at continuation lexicons and make sure they follow standards (some of the lexicon names in tat.lexc in particular)
- Run testvoc (either for each different class separately -- commenting out all other root lexicons -- or without modifying anything if it doesn't take forever)
Fran
- Help calculate BLEU score and explain why it's stupid.
JNW
new style file- fix some of the last few bits of Tatar phonology?
Explain that the irregular A-stem verbs in Tatar are a lexical property of the verbdo the two verb classes in Tatar differ for some other reason than the last vowel in the verb stem (that is, is class membership a lexical property or a phonological property and if the latter, why is this a lexicon distinction and not a morphophonological issue?)
- fix all referents to revision numbers after we have a stable release that we're testing on
- Minor stuff
2nd -> write 2^(nd) or secondtable 1 -> Table 1Fig. 1 -> Figure 1based on context rules the -> based on context rules, thestructural transfer module which -> module, whichCG and Constraint Grammar forms usedTable 4 does not appear in running text.The contents of Sections 2 and 3 do not correspond to the description given in Section 1.
justification/motivation for the system- gain in person hours using system over translating by hand
For a marginalised language community to be able to communicate externally without going through the majority language.An MT system gives a good opportunity to work on other language resources, e.g. morph. analyser, disambiguator, etc. --- these resources can be used for other things.The time-effort : reward pay off is high. With a few months, we can get an effective system. This would not be the case for unrelated languages.Why going through Russian isn't "better": (This is retarded, we should probably put in a comment on why this is stupid. —FMT)Translate documents/sources/etc. efficiently without finding/paying someone to translate it to Russian first
- more text on wikipedia in Kazakh (and generally more "human knowledge" available, maybe):
- kk.wiki 200,000 articles, tt.wiki <50k
- kk:Астана vs. tt:Астана.
incorporated above
- justification stuff
The gain (if any) when using the current system followed by post-editing compared to translating from scratchCould we do a small time-based experiment ? E.g. time taken to translate 2000 words with/without the system ? —FMT
BLEU scoreWe can give the BLEU score with a footnote saying that it is not comparable with anything because our translations are posteditions. —FMTBLEU is stupid, but we can give the numbers, see previous. In the presentation I'm going to have a slide about why BLEU is stupid. —FMT
unincorporated
- add latin-alphabet transcriptions? ☹
- Problem is, a transcription would hide many of the "phonological"/orthographic problems we've had to deal with —Firespeaker 18:15, 1 June 2013 (UTC)
- Only one reviewer complained about this. If someone's really interested in the paper and doesn't know Cyrillic, they can work around it—figuring out Cyrillic to the extent needed to figure out what we're talking about really isn't that hard. It'll take up much more space than it's worth, I think. —Firespeaker 18:15, 1 June 2013 (UTC)
- Lemma-by-lemma analysis
- We could evaluate the system on a lemma-to-lemma basis too, but could you think of examples where we might get the wrong lemma but the right surface form ? —FMT
- In some cases of ambiguity this could happen. —JNW
- take out comparative phonology stuff?
TODO (Original submission)
Ideal benchmarks:
- document rules in the rlx with example sentences
- more like 100-150 (currently ~40) disambiguation rules in -kaz
Ilnar
- Development corpus (lots and lots of text)
Work on increasing coverage (via lexc) and trimmed coverage (via dix) to 90%- Work on making sure testvoc passes
- i.e., corpus testvoc
- add rules — disambigation (CG), lexical selection, and transfer.
Test corpus (about 10 pages; don't base rules on this text!)Make a gold standard translation/correct some tests for error-rate testingDone.
- Paper
Add affiliation to paper- Help JNW come up with some more contrastive stuff
(see below / FIXME: Ilnars in paper)- Tatar equivalent of барайын деп жатырмын "I'm planning on going" ?
- What was it that you noticed with -ғалы/-гелі (and its correspondent in Tatar)?
- Find some exemplary bidix entries for figure 2.
- New example for table 3
- maybe Kazakh equivalent of original sentence: "Ауа райы бүгін өте/әбден жақсы, жылы."
- maybe "Ол енді ол дыбысты анығырақ ести бастады" (some good ambiguity). Unfortunately, current output is "Ул иңне ул тавышны аныграк ишетә башлады". Could we fix this?
Fran
- Delegate out error-rate testing tasks
new version of Table 2
JNW
- Work on last few issues in -tat twol
Write up background- Contrastive analysis of Kazakh and Tatar
phonological differences (a generalised summary, 2 or 3 small specific examples)orthographical differences (a generalised summary, 1 or 2 small specific examples)lexical and morphological differences (2 or 3 specific examples)morphotactic differences (2 or 3 specific examples)- syntactic differences (2 or 3 specific examples)
- Coverage stuff
- divide corpora into 10 pieces and run coverage for each to get stddev
Over-all
Abstract 1 2 3 3.1 3.2 3.3 3.4 4 4.1 4.2 4.3 4.4 4.5 5 5.1 6 Acknowledgements References