Difference between revisions of "Ideas for Google Summer of Code/Automatic diacritic restoration"
Jump to navigation
Jump to search
(plug accentuate.us =P) |
Popcorndude (talk | contribs) m (categorize) |
||
Line 10: | Line 10: | ||
* D. Yarowsky (1994) "[http://citeseer.ist.psu.edu/rd/43728582%2C73251%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/1083/http:zSzzSzwww.cs.jhu.eduzSz%7EyarowskyzSzpubszSzkluwerbook.pdf/yarowsky94comparison.pdf A Comparison Of Corpus-Based Techniques For Restoring Accents In Spanish And French Text]". ''Proceedings, 2nd annual workshop on very large corpora''. pp. 19--32 |
* D. Yarowsky (1994) "[http://citeseer.ist.psu.edu/rd/43728582%2C73251%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/1083/http:zSzzSzwww.cs.jhu.eduzSz%7EyarowskyzSzpubszSzkluwerbook.pdf/yarowsky94comparison.pdf A Comparison Of Corpus-Based Techniques For Restoring Accents In Spanish And French Text]". ''Proceedings, 2nd annual workshop on very large corpora''. pp. 19--32 |
||
* K. Scannell (2010) "[http://borel.slu.edu/pub/lre.pdf Statistical Unicodification of African Languages]". Submitted for publication. |
* K. Scannell (2010) "[http://borel.slu.edu/pub/lre.pdf Statistical Unicodification of African Languages]". Submitted for publication. |
||
+ | |||
+ | [[Category:Ideas_for_Google_Summer_of_Code]] |
Latest revision as of 19:50, 24 March 2020
Kevin Scannell has a Perl implementation of various statistical restoration algorithms called charlifter, which has been trained for more than 100 languages using web crawled data. Details are in his paper linked below. You can try the system here (or install the Firefox extension here).
A port of the algorithm to C++ should be easy. The more subtle issue is to optimize smoothing of the statistical models on a language-by-language basis.
- References
- Simard, Michel (1998). "Automatic Insertion of Accents in French Texts". Proceedings of EMNLP-3. Granada, Spain.
- Rada F. Mihalcea. (2002). "Diacritics Restoration: Learning from Letters versus Learning from Words". Lecture Notes in Computer Science 2276/2002 pp. 96--113
- G. De Pauw, P. W. Wagacha; G.M. de Schryver (2007) "Automatic diacritic restoration for resource-scarce languages". Proceedings of Text, Speech and Dialogue, Tenth International Conference. pp. 170--179
- P.W. Wagacha; G. De Pauw; P.W. Githinji (2006) "A grapheme-based approach to accent restoration in Gĩkũyũ". Proceedings of the Fifth International Conference on Language Resources and Evaluation
- D. Yarowsky (1994) "A Comparison Of Corpus-Based Techniques For Restoring Accents In Spanish And French Text". Proceedings, 2nd annual workshop on very large corpora. pp. 19--32
- K. Scannell (2010) "Statistical Unicodification of African Languages". Submitted for publication.