https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Ksnmi&feedformat=atomApertium - User contributions [en]2024-03-29T07:55:35ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=Publications&diff=50968Publications2014-11-17T14:37:10Z<p>Ksnmi: /* 2014 */</p>
<hr />
<div>{{TOCD}}<br />
<br />
This is a non-comprehensive list of publications involving Apertium, ordered by date. Please feel free to add your paper about Apertium.<br />
<br />
You can also look for Apertium inside [http://mt-archive.info//systems-1.htm The MT-Archive.info systems page], which is updated independently by John Hutchins.<br />
<br />
==2014==<br />
<br />
* Washington, Jonathan N., Ilnar Salimzyanov, and Francis M. Tyers. (2014) "Designing finite-state morphological transducers for Kypchak languages". Proceedings of [http://www.indiana.edu/~mrphfest/ MorphologyFest: Symposium on Morphological Complexity]<br />
<br />
* Nemeskey, D. M., Tyers, F. M. and Hulden, M. (2014) "Why Implementation Matters: Evaluation of an Open-source Constraint Grammar Parser". Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014) (to appear)<br />
<br />
* Peradin, H., Petkovski, F. and Tyers, F. M. (2014) "Shallow-transfer rule-based machine translation for the Western group of South Slavic". Proceedings of the 9th Workshop on Speech and Language Technology for Minority Languages (SALTMIL2014) organised with LREC2014 <br />
<br />
* Washington, J. N., Salimzyanov, I., and Tyers, F. M. (2014) "[http://www.lrec-conf.org/proceedings/lrec2014/pdf/1207_Paper.pdf Finite-state morphological transducers for three Kypchak languages]". Proceedings of the 9th Conference on Language Resources and Evaluation, LREC2014 <br />
<br />
* Marting, M. and Unhammer, K. B., "[http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-SALTMIL%20Proceedings.pdf#page=24 FST Trimming: Ending Dictionary Redundancy in Apertium]", Proceedings of the 9th Conference on Language Resources and Evaluation, LREC2014<br />
<br />
* Minocha, A., & Tyers, F. M. (2014). Subsegmental language detection in Celtic language text. CLTW 2014, 76. Proceedings of the 1st Celtic Language Technology Workshop, COLING 2014.<br />
<br />
==2013==<br />
<br />
* Salimzyanov, Ilnar, Jonathan Washington, and Francis Tyers (2013). A free/open-source Kazakh-Tatar machine translation system. [http://www.mtsummit2013.info/main_proceedings.asp MT Summit XIV].<br />
* Jim O'Regan and Mikel L. Forcada (2013) "[http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/download/4867/2882 Peeking through the language barrier: the development of a free/open-source gisting system for Basque to English based on apertium.org]". Procesamiento del Lenguaje Natural 51, 15-22.<br />
<br />
==2012==<br />
<br />
* Antoni Oliver González (2012): "[http://linguamatica.com/index.php/linguamatica/article/view/136 WN-Toolkit: un toolkit per a la creació de WordNets a partir de diccionaris bilingües]". Linguamática V4N2. ([http://lpg.uoc.edu/wn-toolkit/ WN-Toolkit] includes a tool for reading Apertium dictionaries).<br />
* Hernani Marques (2012): "[https://www.ccczh.ch/images/8/84/CL_120606--apertium%2Bfst4web.pdf Integration von Finite-State Transducer-Technologien in Apertium zur Maschinellen Übersetzung morphologisch komplexer Sprachen]". Seminararbeit bei Anne Göhring, Magdalena Jitca und Prof. Dr. Michael Hess. Institut für Computerlinguistik der Universität Zürich.<br />
* Hrvoje Peradin and Francis Tyers (2012): "[http://www.molto-project.eu/sites/default/files/FreeRBMT-2012.pdf#61 A rule-based machine translation system from Serbo-Croatian to Macedonian]". Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012).<br />
* Juan Pablo Martínez Cortés, Jim O'Regan, and Francis Tyers (2012): "[http://www.lrec-conf.org/proceedings/lrec2012/summaries/326.html Free/Open Source Shallow-Transfer Based Machine Translation for Spanish and Aragonese]". LREC 2012.<br />
* Trond Trosterud and Kevin Brubeck Unhammer (2012): "[http://www.molto-project.eu/sites/default/files/FreeRBMT-2012.pdf#19 Evaluating North Sámi to Norwegian assimilation RBMT]". Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012).<br />
* Tyers, Francis, Ilnar Salimzyanov, and Jonathan Washington (2012): "A proto-type Bashkir-Tatar machine translation system". [http://multisaund.eu/program.php LREC 2012].<br />
* V. M. Sánchez-Cartagena, M. Esplà-Gomis, F. Sánchez-Martínez, J. A. Pérez-Ortiz (2012): "[http://www.molto-project.eu/sites/default/files/FreeRBMT-2012.pdf#33 Choosing the correct paradigm for unknown words in rule-based machine translation systems]. Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012).<br />
* V. M. Sánchez-Cartagena, F. Sánchez-Martínez, J. A. Pérez-Ortiz (2012): "[http://www.molto-project.eu/sites/default/files/FreeRBMT-2012.pdf#47 An Open-Source Toolkit for Integrating Shallow-Transfer Rules into Phrase-Based Statistical Machine Translation]". Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012).<br />
* Washington, Jonathan, Mirlan Ipasov, and Francis Tyers (2012): "[http://www.lrec-conf.org/proceedings/lrec2012/summaries/1077.html A finite-state morphological transducer for Kyrgyz]". LREC 2012.<br />
<br />
==2011==<br />
<br />
* Mikel L. Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O’Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez and Francis M. Tyers: "[http://www.springerlink.com/content/h134p1j73377071k/?MUD=MP Apertium: a free/open-source platform for rule-based machine translation]". In ''Machine Translation: Volume 25, Issue 2 (2011), p. 127-144''<br />
<br />
* Jacob Nordfalk: "Maŝintradukado - kiel ĝi funkcias, kion ĝi kapablas". In: Wandel, Amri (red.). ''Internacia Kongresa Universitato. 64-a Sesio''. Rotterdam: Universala Esperanto-Asocio, 2011, p. 121-137.<br />
<br />
* Jacob Nordfalk & Hèctor Alòs i Font: "Apertium kaj Esperanto: Maŝintradukado al kaj el Esperanto per malfermitkoda platformo". In: Novoská, Katarina; Baláž, Peter (red.). ''Modernaj teknologioj por Esperanto''. Partizánske (SK): E@I, 2011, p. 117-125.<br />
<br />
* Jacob Nordfalk & Hèctor Alòs i Font "[http://www.teleskopo.com/2011.htm Apertium kaj Esperanto - Enkonduko al peregula maŝintradukado al kaj el Esperanto per malfermkoda platformo]". ''Teleskopo'' (2011) 3: 5-19.<br />
<br />
* Martha Dís Brandt, Hrafn Loftsson, Hlynur Sigurþórsson, and Francis M.Tyers: "[http://www.mt-archive.info/EAMT-2011-Brandt.pdf Apertium-IceNLP: a rule-based Icelandic to English machine translation system]. EAMT 2011: Proceedings of the 15th Conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium; eds. Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste; pp.217-224.<br />
<br />
* Xavier Ivars-Ribes & Victor M.Sánchez-Cartagena: [http://www.mt-archive.info/FreeRBMT-2011-Ivar-Ribes.pdf A widely used machine translation service and its migration to a free/open-source solution: the case of Softcatalà]. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, Barcelona, Spain, January 20-21, 2011, ed. F.Sánchez-Martínez and J.A.Pérez-Ortiz; pp.61-68.<br />
<br />
* Pim Otte & Francis M.Tyers: [http://www.mt-archive.info/EAMT-2011-Otte.pdf Rapid rule-based machine translation between Dutch and Afrikaans]. EAMT 2011: proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium; eds. Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste; pp.153-160.<br />
<br />
* Joanna Ruth & Jimmy O’Regan: [http://www.mt-archive.info/FreeRBMT-2011-Ruth.pdf Shallow-transfer rule-based machine translation from Czech to Polish]. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, Barcelona, Spain, January 20-21, 2011, ed. F.Sánchez-Martínez and J.A.Pérez-Ortiz; pp.69-76.<br />
<br />
* Antonio Toral & Andy Way: [http://www.mt-archive.info/FreeRBMT-2011-Toral-1.pdf Automatic acquisition of named entities for rule-based machine translation]. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, Barcelona, Spain, January 20-21, 2011, ed. F.Sánchez-Martínez and J.A.Pérez-Ortiz; pp.37-43. <br />
<br />
* Antonio Toral, Mireia Ginestí-Rosell, & Francis Tyers: [http://www.mt-archive.info/FreeRBMT-2011-Toral-2.pdf An Italian to Catalan RBMT system reusing data from existing language pairs]. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, Barcelona, Spain, January 20-21, 2011, ed. F.Sánchez-Martínez and J.A.Pérez-Ortiz; pp.77-81.<br />
<br />
* Arnaud Vié, Luis Villarejo Muñoz, Mireia Farrús Cabeceran, & Jimmy O’Regan: [http://www.mt-archive.info/FreeRBMT-2011-Vie.pdf Apertium advanced web interface: a first step towards interactivity and language tools convergence]. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, Barcelona, Spain, January 20-21, 2011, ed. F.Sánchez-Martínez and J.A.Pérez-Ortiz; pp.45-51<br />
<br />
* Antonio Toral, Federico Gaspari, Sudip Kumar Naskar, & Andy Way: [http://www.mt-archive.info/EAMT-2011-Toral-1.pdf Comparative evaluation of research vs. online MT systems]. EAMT 2011: proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium; eds. Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste; pp.13-20 (Uses apertium-de-en and apertium-en-nl data to assist in bitext alignment)<br />
<br />
* Heidi Depraetere, Joachim Van den Bogaert, & Joeri Van de Walle: [http://www.mt-archive.info/EAMT-2011-Depraetere.pdf Bologna translation service: online translation of course syllabi and study programmes in English]. EAMT 2011: proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium; eds. Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste; pp.29-34 (mention the use of Apertium for syllabi translation)<br />
<br />
* Tomáš Hudík & Achim Ruopp: [http://www.mt-archive.info/EAMT-2011-Hudik.pdf The integration of Moses into localisation industry]. EAMT 2011: proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium; eds. Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste; pp.47-53 (cites Apertium alongside Moses as "the most developed" MT systems).<br />
<br />
* Felipe Sánchez-Martínez: [http://www.mt-archive.info/EAMT-2011-Sanchez-Martinez.pdf Choosing the best machine translation system to translate a sentence by using only source-language information]. EAMT 2011: proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium; eds. Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste; pp.97-104. (one of the systems is Apertium; unfortunately, it is seldom selected as the best system)<br />
<br />
* Sarah Ebling, Andy Way, Martin Volk, & Sundip Kumar Naskar: [http://www.mt-archive.info/EAMT-2011-Ebling.pdf Combining semantic and syntactic generalisation in example-based machine translation]. EAMT 2011: proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium; eds. Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste; pp.209-216 (cites Apertium as a source for "marker words")<br />
<br />
* Aish Raj Dahal: [http://ltrc.iiit.ac.in/icon2011/technical_schedule.html Development of a Nepali Morphological Analyzer]. ICON 2011: proceedings of the 9th International Conference on Natural Language Processing, 15-19 Dec 2011, Chennai, India;Student Paper Track; (cites Apertium as the underlying technology behind Nepali Morphological Analyzer)<br />
<br />
==2010==<br />
<br />
* Jacob Nordfalk: "[http://blad.dkuug.dk/arkiv/DKUUG160.pdf Open source maskinoversættelse med Apertium]". DKUUG-nyt nr 160, p. 4-8.<br />
<br />
* Linda Wiechetek, Francis M. Tyers and Thomas Omma (2010) "[http://xixona.dlsi.ua.es/~fran/publications/icetal2010.pdf Shooting at flies in the dark: Rule-based lexical selection for a minority language pair]". ''Lecture Notes in Artificial Intelligence'' vol. 6233/2010, pp. 418--429 <br />
<br />
* Francis M. Tyers (2010) "[http://xixona.dlsi.ua.es/~fran/publications/eamt2010.pdf Rule-based Breton to French machine translation]". 'Proceedings of the 14th Annual Conference of the European Association of Machine Translation, EAMT10', pp. 174--181<br />
<br />
* Sergio Penkale, Rejwanul Haque, Sandipan Dandapat, Pratyush Banerjee, Ankit K. Srivastava, Jinhua Du, Pavel Pecina, Sudip Kumar Naskar, Mikel L. Forcada, Andy Way, "[http://www.dlsi.ua.es/~mlf/docum/dcu-wmt2010.pdf MaTrEx: the DCU MT system for WMT 2010]", in ''Proceedings of WMT 2010: The ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR'' (to be presented) <br />
<br />
* Aline Villavicencio, Carlos Ramisch, André Machado, Maria José ́Finatto, Helena de Medeiros Caseli (2010) "Identificação de Expressões Multipalavra em Domínios Específicos". ''Linguamatica'' 2(1) pp. 15--34<br />
<br />
* Sánchez-Cartagena, V. M. and Pérez-Ortiz, J. A. (2010) "[http://www.dlsi.ua.es/~japerez/pub/pdf/mtmarathon2010-scalemt.pdf ScaleMT: a free/open-source framework for building scalable machine translation web services]". The Prague Bulletin of Mathematical Linguistics 93, pp. 97--106. <br />
<br />
* Tyers, F. M. and Sánchez-Martínez, F. and Ortiz-Rojas, S. and Forcada, M. L. (2010) "[http://xixona.dlsi.ua.es/~fran/publications/mtm2010.pdf Free/open-source resources in the Apertium platform for machine translation research and development]". The Prague Bulletin of Mathematical Linguistics No. 93, pp. 67--76<br />
<br />
* François Masselot, Petra Ribiczey, & Gema Ramírez-Sánchez (2010) "[http://www.mt-archive.info/EAMT-2010-Masselot.pdf Using the Apertium Spanish-Brazilian Portuguese machine translation system for localisation]". EAMT 2010: Proceedings of the 14th Annual conference of the European Association for Machine Translation, 27-28 May 2010, Saint-Raphaël, France. Proceedings ed.Viggo Hansen and François Yvon; 8pp. [PDF, 577KB]; [http://www.mt-archive.info/EAMT-2010-Masselot-ppt.pdf presentation: 23 slides] [PDF, 569KB]<br />
<br />
* Septina Dian Larasati and Vladislav Kuboň (2010) "[http://dl.dropbox.com/u/537350/paper/MALINDO-2010-final.pdf A Study of Indonesian-to-Malaysian MT System]". MALINDO 2010: Proceedings of the 4th International MALINDO Workshop. Jakarta, Indonesia, 2 August 2010.<br />
<br />
* Mikel L.Forcada: [http://www.mt-archive.info/MTMarathon-2010-Forcada-ppt.pdf Apertium: free/open-source rule-based machine translation]. Presentation at Fourth Machine Translation Marathon “Open Source Tools for Machine Translation”, 29 January, Dublin, Ireland; 38 slides.<br />
<br />
* Mikel L.Forcada: [http://www.mt-archive.info/Translingual-Europe-2010-Forcada.pdf Free/open-source machine translation: the Apertium platform]. Translingual Europe 2010, Hotel Maritim, Berlin, Germany, Monday June 7th 2010; 17pp.<br />
<br />
==2009==<br />
<br />
* Sánchez-Cartagena, V. M. and Pérez-Ortiz, J. A. (2009) "[http://rua.ua.es/dspace/bitstream/10045/12030/1/paper7.pdf An open-source highly scalable web service architecture for the Apertium machine translation engine]". First International Workshop on Free/Open-Source Rule-Based Machine Translation, Alicante, Spain, pp. 51--58.<br />
<br />
* Ginestí-Rosell, M. and Ramírez-Sánchez, G. and Ortiz-Rojas, S. and Tyers, F. M. and Forcada, M. L. (2009) "[http://sepln.org/revistaSEPLN/revista/43/articulos/art21.pdf Development of a free Basque to Spanish machine translation system]". Procesamiento de Lenguaje Natural No. 43. pp. 187--195<br />
<br />
* Tyers, F. M. (2009) "[http://xixona.dlsi.ua.es/~fran/publications/eamt2009b.pdf Rule-based augmentation of training data in Breton–French statistical machine translation]". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. <br />
<br />
* Tyers, F. M. and Wiechetek, L. and Trosterud, T. (2009) "[http://xixona.dlsi.ua.es/~fran/publications/eamt2009a.pdf Developing prototypes for machine translation between two Sámi languages]". Proceedings of the 13th Annual Conference of the European Association of Machine Translation, EAMT09. <br />
<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/sanchez-martinez09b.bib bibtex]) Sánchez-Martínez, F. and Forcada, M.L. (2009). "[http://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez-martinez09b.pdf Inferring shallow-transfer machine translation rules from small parallel corpora]". In Journal of Artificial Intelligence Research. volume 34, p. 605-635.<br />
<br />
* Sánchez-Martínez, F. and Forcada, M.L. and Way, A. (2009) "[http://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez-martinez09d.pdf Hybrid rule-based ‒ example-based MT: Feeding Apertium with sub-sentential translation units]". In Proceedings of the 3rd Workshop on Example-Based Machine Translation, p. 11-18, Dublin, Ireland.<br />
<br />
* Sheikh, Z.M.A.W. and Sánchez-Martínez, F. (2009) "[http://www.dlsi.ua.es/~fsanchez/pub/pdf/zaid09.pdf A trigram part-of-speech tagger for the Apertium free/open-source machine translation platform"]. In Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation, p. 67-74, Alacant, Spain.<br />
<br />
* Tyers, Francis M. and Donnelly, Kevin (2009) "[http://xixona.dlsi.ua.es/~fran/publications/mtm2009.pdf apertium-cy - a collaboratively-developed free RBMT system for Welsh to English]". The Prague Bulletin of Mathematical Linguistics No. 91, pp. 57-66.<br />
<br />
* Unhammer, Kevin; Trosterud, Trond. "[http://rua.ua.es/dspace/handle/10045/12025 Reuse of free resources in machine translation between Nynorsk and Bokmål]". In: Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation / Edited by Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Francis M. Tyers. Alicante : Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, 2009, pp. 35-42<br />
<br />
==2008==<br />
<br />
* Homola, Petr and Kuboň, Vladislav (2008). "[http://www.mt-archive.info/EAMT-2008-Homola.pdf Improving Machine Translation Between Closely Related Romance Languages]". Proceedings of the European Association of Machine Translation, Hamburg.<br />
<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/sanchez08b.bib bibtex]) Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada (2008). "[http://www.springerlink.com/content/m452802q3536044v/?p=61e26194c87e4a5780c77303b3210210&pi=2 Using target-language information to train part-of-speech taggers for machine translation]". In Machine Translation, volume 22, numbers 1-2, p. 29-66.<br />
<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/thesis/thesis.bib bibtex]) Felipe Sánchez-Martínez (2008). "[http://www.dlsi.ua.es/~fsanchez/pub/thesis/thesis-sin.pdf Using unsupervised corpus-based methods to build rule-based machine translation systems]". PhD thesis, Departament de Llenguatges i Sistemes Infomàtics, Universitat d'Alacant, Spain.<br />
<br />
* Helena de Medeiros Caseli, Maria das Graças Volpe Nunes, Mikel L. Forcada (2008). "[http://www.springerlink.com/content/tv807t35h133510k/?p=ea722c920b5e45d08083cbfdbb7621fd&pi=1 Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation]". In Machine Translation, volume 20, numbers 4, p. 227-245.<br />
<br />
* Carme Armentano Oller and Mikel L. Forcada (2002) “Reutilización de datos lingüísticos para la creación de un sistema de traducción automática para un nuevo par de lenguas”. Procesamiento del lenguaje natural. N. 41 (sept. 2008). ISSN 1135-5948, pp. 243-250<br />
<br />
==2007==<br />
<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/sanchez07c.bib bibtex]) Felipe Sánchez-Martínez, Mikel L. Forcada (2007) "[http://www.dlsi.ua.es/~mlf/docum/sanchezmartinez07p2.pdf Automatic induction of shallow-transfer rules for open-source machine translation]", in Proceedings of TMI, The Eleventh Conference on Theoretical and Methodological Issues in Machine Translation (TMI 2007) (Skövde, Sweden, 7-9/09/2007) , p. 181--190<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/sanchez07b.bib bibtex]) Felipe Sánchez-Martínez, Carme Armentano-Oller, Juan Antonio Pérez-Ortiz, Mikel L. Forcada (2007) "[http://www.dlsi.ua.es/~mlf/docum/sanchezmartinez07j.pdf Training part-of-speech taggers to build machine translation systems for less-resourced language pairs]", ''Procesamiento del Lenguaje Natural'', (XXIII Congreso de la Sociedad Española de Procesamiento del Lenguaje Natural ,Sevilla, Spain, 10-12/09/2007; accepted) 39, 257--264<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/sanchez07a.bib bibtex]) Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada (2007) "[http://www.dlsi.ua.es/~mlf/docum/sanchezmartinez07p.pdf Integrating corpus-based and rule-based approaches in an open-source machine translation system]". in Proceedings of METIS-II Workshop: New Approaches to Machine Translation, a workshop at CLIN 17 - Computational Linguistics in the Netherlands ((Leuven, Belgium, 11/01/2007)) , p. 73--8<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/armentano07.bib bibtex]) Carme Armentano-Oller, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Marco A. Montava, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez (2007) "[http://www.dlsi.ua.es/~mlf/docum/armentano07p.pdf Apertium, una plataforma de código abierto para el desarrollo de sistemas de traducción automática]", in Proceedings of FLOSS (Free/Libre/Open Source Systems) International Conference ((7-9/03/2007, Jerez de la Frontera, Spain)) , p. 5--20<br />
<br />
==2006==<br />
<br />
* Carme Armentano-Oller, Mikel L. Forcada (2006) "[http://www.dlsi.ua.es/~mlf/docum/armentano06p2.pdf Open-source machine translation between small languages: Catalan and Aranese Occitan]", in Strategies for developing machine translation for minority languages (5th SALTMIL workshop on Minority Languages) (organized in conjunction with LREC 2006 (22-28.05.2006)) , p. 51-54<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/ramirez06.bib bibtex]) Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Mikel L. Forcada (2006) "[http://www.dlsi.ua.es/~mlf/docum/ramirezsanchez06p.pdf Opentrad Apertium open-source machine translation system: an opportunity for business and research]", in Proceedings of Translating and the Computer 28 Conference ((London, november 16--17, 2006)) <br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/sanchez06b.bib bibtex]) Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada (2006) "[http://www.dlsi.ua.es/~mlf/docum/sanchezmartinez06p.pdf Speeding up target-language driven part-of-speech tagger training for machine translation]". in Lecture notes in Computer Science vol. 4293: MICAI 2006: Advances in Artificial Intelligence.5th Mexican International Conference on Artificial Intelligence, Apizaco, Mexico, November 13-17, 2006. Proceedings. ((c) Springer-Verlag 2006) , p. 844-854<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/armentano06.bib bibtex]) Carme Armentano-Oller, Rafael C. Carrasco, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Miriam A. Scalco (2006) "[http://www.dlsi.ua.es/~mlf/docum/armentano06p.pdf Open-source Portuguese-Spanish machine translation]", in In Lecture Notes in Computer Science 3960 (Computational Processing of the Portuguese Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR 2006), May 13-17, 2006, ME - RJ / Itatiaia, Rio de Janeiro, Brazil. ((c) Springer-Verlag 2006) , p. 50-59<br />
* Forcada, Mikel L. (2006) "[http://dlsi.ua.es/~mlf/docum/forcada06p2.pdf Open-source machine translation: an opportunity for minor languages]" in B. Williams (ed.): Proceedings of the Workshop "Strategies for developing machine translation for minority languages (5th SALTMIL workshop on Minority Languages)" (organised in conjunction with LREC 2006 (22-28.05.2006)). Genoa, Italy, pp. 1-6.<br />
<br />
==2005==<br />
<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/armentano05p.bib bibtex]) Carme Armentano-Oller, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Boyan Bonev, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez (2005) "[http://www.dlsi.ua.es/~mlf/docum/armentano05p.pdf An open-source shallow-transfer machine translation toolbox: consequences of its release and availability]", in Proceedings of OSMaTran: Open-Source Machine Translation, A workshop at Machine Translation Summit X (Phuket, Thailand, September 12--16, 2005),<br />
* ([http://www.dlsi.ua.es/~fsanchez/pub/bib/corbi05.bib bibtex]) Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Sánchez-Ramírez, Felipe Sánchez-Martínez, Iñaki Alegria, Aingeru Mayor, Kepa Sarasola (2005) "[http://www.dlsi.ua.es/~mlf/docum/corbibellot05p.pdf An open-source shallow-transfer machine translation engine for the romance languages of Spain]", in Proceedings of the European Associtation for Machine Translation, 10th Annual Conference (Budapest, Hungary, 30-31.05.2005), p. 79--86<br />
* Ortiz-Rojas, S., Forcada, M. L., and Ramírez-Sánchez, G. (2005) "Construccion y minimización eficiente de transductores de letras a partir de diccionarios con paradigmas". Procesamiento del Lenguaje Natural, 35, 51–57. <br />
<br />
[[Category:Documentation]]<br />
[[Category:Documentation in English]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Task_ideas_for_Google_Code-in&diff=50967Task ideas for Google Code-in2014-11-17T14:33:33Z<p>Ksnmi: /* Data mangling */</p>
<hr />
<div>{{TOCD}}<br />
This is the task ideas page for [http://www.google-melange.com/gci/homepage/google/gci2014 Google Code-in], here you can find ideas on interesting tasks that will improve your knowledge of Apertium and help you get into the world of open-source development.<br />
<br />
The people column lists people who you should get in contact with to request further information. All tasks are 2 hours maximum estimated amount of time that would be spent on the task by an experienced developer, however:<br />
<br />
# '''this does not include time taken to [[Minimal installation from SVN|install]] / set up apertium'''.<br />
# this is the time expected to take by an experienced developer, you may find that you spend more time on the task because of the learning curve. <br />
<br />
<!--Если ты не понимаешь английский язык или предпочитаешь работать над русским языком или другими языками России, смотри: [[Task ideas for Google Code-in/Russian]]--><br />
'''Categories:'''<br />
<br />
* {{sc|code}}: Tasks related to writing or refactoring code<br />
* {{sc|documentation}}: Tasks related to creating/editing documents and helping others learn more<br />
* {{sc|research}}: Tasks related to community management, outreach/marketting, or studying problems and recommending solutions<br />
* {{sc|quality}}: Tasks related to testing and ensuring code is of high quality.<br />
* {{sc|interface}}: Tasks related to user experience research or user interface design and interaction<br />
<br />
You can find descriptions of some of the mentors here: [[List_of_Apertium_mentors]].<br />
<br />
==Task list==<br />
<br />
=== Misc tools ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || Unigram tagging mode for <code>apertium-tagger</code> || Edit the <code>apertium-tagger</code> code to allow for lexicalised unigram tagging. This would basically choose the most frequent analysis for each surface form of a word. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Data format for the unigram tagger || Come up with a binary storage format for the data used for the unigram tagger. It could be based on the existing <code>.prob</code> format. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Add tag combination back-off to unigram tagger. || Modify the unigram tagger to allow for back-off to tag sequence in the case that a given form is not found. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Prototype unigram tagger. || Write a simple unigram tagger in a language of your choice. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Training for unigram tagger || Write a program that trains a model suitable for use with the unigram tagger. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || make voikkospell understand apertium stream format input || Make voikkospell understand apertium stream format input, e.g. ^word/analysis1/analysis2$, voikkospell should only interpret the 'word' part to be spellchecked. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make voikkospell return output in apertium stream format || make voikkospell return output suggestions in apertium stream format, e.g. ^correctword$ or ^incorrectword/correct1/correct2$ || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || libvoikko support for OS X || Make a spell server for OS X's system-wide spell checker to use arbitrary languages through libvoikko. See https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/SpellCheck/Tasks/CreatingSpellServer.html#//apple_ref/doc/uid/20000770-BAJFBAAH for more information || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Ubuntu/debian || document how to set up libreoffice voikko working with a language on Ubuntu and debian || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Fedora || document how to set up libreoffice voikko working with a language on Fedora || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Windows || document how to set up libreoffice voikko working with a language on Windows || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on OS X || document how to set up libreoffice voikko working with a language on OS X || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document how to set up libenchant to work with libvoikko || Libenchant is a spellchecking wrapper. Set it up to work with libvoikko, a spellchecking backend, and document how you did it. You may want to use a spellchecking module available in apertium for testing. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || firefox/iceweasel plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] .<br />
|| [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || chrome/chromium plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] . || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || firefox/iceweasel plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || chrome/chromium plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || firefox/iceweasel plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || chrome/chromium plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|quality}} || make apertium-quality work with python3.3 on all platforms || migrate apertium-quality away from distribute to newer setup-tools so it installs correctly in more recent versions of python (known incompatible: python3.3 OS X, known compatible: MacPorts python3.2) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || Get bible aligner working (or rewrite it) || trunk/apertium-tools/bible_aligner.py - Should take two bible translations and output a tmx file with one verse per entry. There is a standard-ish plain-text bible translation format that we have bible translations in, and we have files that contain the names of verses of various languages mapped to English verse names || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || tesseract interface for apertium languages || Find out what it would take to integrate apertium or voikkospell into tesseract. Document thoroughly available options on the wiki. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|code}} || Syntax tree visualisation using GNU bison || Write a program which reads a grammar using bison, parses a sentence and outputs the syntax tree as text, or graphViz or something. Some example bison code can be found [https://svn.code.sf.net/p/apertium/svn/branches/transfer4 here]. || [[User:Francis Tyers]] [[User:Mlforcada]]<br />
|-<br />
| {{sc|code}} || make concordancer work with output of analyser || Allow [http://pastebin.com/raw.php?i=KG8ydLPZ spectie's concordancer] to accept an optional apertium mode and directory (implement via argparse). When it has these, it should run the corpus through that apertium mode and search against the resulting tags and lemmas as well as the surface forms. E.g., the form алдым might have the analysis via an apertium mode of ^алдым/алд{{tag|n><px1sg}}{{tag|nom}}/ал{{tag|v><tv}}{{tag|ifi><p1}}{{tag|sg}}, so a search for "px1sg" should bring up this word. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || convert a current transducer for a language using lexc+twol to a guesser || Figure out how to generate a guesser for a language module that uses lexc for morphotactics and twol for morphophonology (e.g., apertium-kaz). One approach to investigate would be to generate all the possible archiphoneme representations of a given form and run the lexc guesser on that. || [[User:Firespeaker]] [[User:Flammie]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in hfst || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in HFST. The script should take a language code and create a new directory with a minimal lexc file, a minimal twol file, and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in lttoolbox || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in lttoolbox. The script should take a language code and create a new directory with a minimal dix file and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium bilingual module || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium bilingual module. The script should take two language codes and create a new directory with a minimal dix file, a minimal lrx file, and minimal transfer (.t*x) files, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Write a script to explain an Apertium machine translation in terms of its parts || Write a script (preferably in python3 or bash/equivalent) that takes one text segment ''S'', applies a given Apertium system to it and to all its possible whole-word subsegments ''s'' (perhaps up to a certain maximum length) and outputs a list ''(s,t,i,j,k,l)'' of correspondences so that the result of applying Apertium to ''s'' is ''t'', ''t'' is a whole-word subsegment of ''T'', the Apertium translation of ''S'', ''i'' and ''j'' are the starting position and end position of ''s'' in ''S'' and ''k'' and ''l'' are hte starting position and the end postion of ''t'' in ''T''. The script should read ''S'', ''T'', two language codes and optionally a maximum length and generate the correspondences ''(s,t,i,j,k,l)'' one per line || [[User:mlforcada]]<br />
|}<br />
<br />
=== Website and apy ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || apertium-apy mode for geriaoueg (biltrans in context) || apertium-apy function that accepts a context (e.g., ±n ~words around word) and a position in the context of a word, gets biltrans output on entire context, and returns translation for the word || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || SSL/HTTPS for Apertium.org || The Apertium site itself is equipped with SSL. Get Piwik working on HTTPS as well. After that, default to the HTTPS site via Apache. See [http://sourceforge.net/p/apertium/tickets/41/ ticket 41] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Website translation in [[Html-tools]] (code) || Html-tools should detect when the user wants to translate a website (similar to how Google Translate does it) and switch to an interface (See "Website translation in [[Html-tools]] (interface)" task) and perform the translation. It should also make it so that new pages that the user navigates to are translated. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|interface}} || Website translation in [[Html-tools]] (interface) || Add an interface to Html-tools that shows a webpage in an <iframe> with translation options and a back button to return to text/document translation. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] crashing on iPads when copying text || Make it so that the Apertium site does not crash on iPads when copying text on any of the modes while maintaining semantic HTML. This task requires having access to an iPad. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] copying text on Windows Phone IE || Make it so that the Apertium site allows copying text on WP while maintaining semantic HTML. This task requires having access to an Windows Phone. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[APY]] API keys || Add API key support but don't overengineer it. See [http://sourceforge.net/p/apertium/tickets/31/ ticket 31] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Xavivars]] <br />
|-<br />
| {{sc|code}} || Localisation of tag attributes on [[Html-tools]] || The meta description tag isn't localized as of now since the text is an attribute. Search engines often display this as their snippet. A possible way to achieve this is using data-text="@content@description". See [http://sourceforge.net/p/apertium/tickets/29/ ticket 29] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] font issues || See [http://sourceforge.net/p/apertium/tickets/27/ ticket 27] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Detect target language || When changing the source language, the [[Html-tools]] UI will often show a bunch of greyed out buttons, and the user has to fish for possible languages in the right-hand side drop-down. This is confusing (user might think "are there no languages to translate into?") and annoying. A simple solution is to reorder the list so that all possible target languages are shown first, then the list of greyed-out languages. See [http://sourceforge.net/p/apertium/tickets/25/ ticket 25] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Maintaining order of user interactions on [[Html-tools]] || If a user clicks a new language choice while translation or detection is proceeding (AJAX callback has not yet returned), the original action will not be cancelled. Make it so that the first action is canceled and overridden by the second. See [http://sourceforge.net/p/apertium/tickets/9/ ticket 9] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Drag-n-drop file translation on [[Html-tools]] || See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || More file formats for [[APY]] || APY does not support DOC, XLS, PPT file translation that require the file being converted to the newer XML based formats through LibreOffice or equivalent and then back. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Improved file translation functionality for [[APY]] || APY needs logging and to be non-blocking for file translation. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|interface}} || Abstract the formatting for the [[Html-tools]] interface. || The Html-tools interface should be easily customisable so that people can make it look how they want. The task is to abstract the formatting and make one or more new stylesheets to change the appearance. This is basically making a way of "skinning" the interface. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|interface}} || [[Html-tools]] spell-checker interface || Add an enableable spell-checker module to the [[html-tools]] interface. Get fancy with jquery/etc. so that e.g., misspelled words are underlined in red and recommendations for each word are given in some sort of drop-down menu. Feel free to implement a dummy function for testing spelling to test the interface until the "Html-tools spell-checker code" task is complete. There is a half-done version available from last year that may just need to be cleaned up and integrated into the current html-tools code. See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[Html-tools]] spell-checker code || Add code to the [[html-tools]] interface that allows spell checking to be performed. Should send entire string, and be able to match each returned result to its appropriate input word. Should also update as new words are typed (but [https://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-tools/apertium-html-tools/assets/js/translator.js#l42 not on every keystroke]). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[libvoikko]] support for [[APY]] || Write a function for [[APY]] that checks the spelling of an input string and for each word returns whether the word is correct, and if unknown returns suggestions. Whether segmentation is done by the client or by apertium-apy will have to be figured out. You will also need to add scanning for spelling modes to the initialisation section. Try to find a sensible way to structure the requests and returned data with JSON. Add a switch to allow someone to turn off support for this (use argparse set_false). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] expanding textareas || The input textarea in the html-tools translation interface does not expand depending on the user's input even when there is significant whitespace remaining on the page. Improvements include varying the length of the textareas to fill up the viewport or expanding depending on input. Both the input and output textareas would have to maintain the same length for interface consistency. Different behavior may be desired on mobile. See [http://sourceforge.net/p/apertium/tickets/4/ ticket 4] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Performance tracking in [[APY]] || Add a way for [[APY]] to keep track of number of words in input and time between sending input to a pipeline and receiving output, for the last n (e.g., 100) requests, and write a function to return the average words per second over something<n (e.g., 10) requests. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Make [[APY]] use one lock per pipeline || Make [[APY]] use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Language variant picker in [[Html-tools]] || Displaying language variants as distinct languages in the translator language selector is awkward and repetitive. Allowing users to first select a language and then display radio buttons for choosing a variant below the relevant translation box, if relevant, provides a better user interface. See [http://sourceforge.net/p/apertium/tickets/1/ ticket 1] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Investigate how to implement HTML-translation that can deal with broken HTML || The old Apertium website had a 'surf-and-translate' feature, but it frequently broke on badly-behaved HTML. Investigate how similar web sites deal with broken HTML when rewriting the internal content of a (possible automatically generated) HTML page. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Add permalink capability for generation and analysis [[Html-tools]] || [[Html-tools]] currently has support for permalinks to various translation modes. For this task, you should add similar support for analysis and generation modes. I.e., a person should be able to simply send someone a link for e.g., the Kazakh morphological analyser. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Pair visualisations ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || fix pairviewer's 2- and 3-letter code conflation problems || [[pairviewer]] doesn't always conflate languages that have two codes. E.g. sv/swe, nb/nob, de/deu, da/dan, uk/ukr, et/est, nl/nld, he/heb, ar/ara, eus/eu are each two separate nodes, but should instead each be collapsed into one node. Figure out why this isn't happening and fix it. Also, implement an algorithm to generate 2-to-3-letter mappings for available languages based on having the identical language name in languages.json instead of loading the huge list from codes.json; try to make this as processor- and memory-efficient as possible. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || map support for pairviewer ("pairmapper") || Write a version of [[pairviewer]] that instead of connecting floating nodes, connects nodes on a map. I.e., it should plot the nodes to an interactive world map (only for languages whose coordinates are provided, in e.g. GeoJSON format), and then connect them with straight-lines (as opposed to the current curved lines). Use an open map framework, like [http://leafletjs.com leaflet], [http://polymaps.org polymaps], or [http://openlayers.org openlayers] || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || coordinates for Mongolic languages || Using the map [https://en.wikipedia.org/wiki/File:Linguistic_map_of_the_Mongolic_languages.png Linguistic map of the Mongolic languages.png], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format that can be loaded by pairmapper (or, e.g., converted to kml and loaded in google maps). The file should contain points that are a geographic "center" (locus) for where each Mongolic language on that map is spoken. Use the term "Khalkha" (iso 639-3 khk) for "Mongolisch", and find a better map for Buryat. You can use a capital city for bigger, national languages if you'd like (think Paris as a locus for French). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || draw languages as areas for pairmapper || Make a map interface that loads data (in e.g. GeoJSON or KML format) specifying areas where languages are spoken, as well as a single-point locus for the language, and displays the areas on the map (something like [http://leafletjs.com/examples/choropleth.html the way the states are displayed here]) with a node with language code (like for [[pairviewer]]) at the locus. This should be able to be integrated into pairmapper, the planned map version of pairviewer. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Tatar, Bashqort, and Chuvash || Using the maps listed here, try to define rough areas for where Tatar, Bashqort, and Chuvash are spoken. These areas should be specified in a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. Try to be fairly accurate and detailed. Maps to consult include [https://commons.wikimedia.org/wiki/File:Tatarbashkirs1989ru.PNG Tatarsbashkirs1989ru], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP] || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus Turkic languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Kumyk, Nogay, Karachay, Balkar. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for IE and Mongolic Caucasus-area languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Ossetian, Armenian, Kalmyk. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Avar, Chechen, Abkhaz, Georgian. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Kazakh || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Kazakh is spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Uzbek and Uyghur || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Uzbek and Uyghur are spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference areas Russian is spoken || Assume areas in Central Asia with any sort of measurable Russian population speak Russian. Use the following maps to create a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin: [https://commons.wikimedia.org/wiki/File:Kazakhstan_European_2012_Rus.png Kazakhstan_European_2012_Rus], [https://commons.wikimedia.org/wiki/File:Ethnicrussians1989ru.PNG Ethnicrussians1989ru], [https://commons.wikimedia.org/wiki/File:Lenguas_eslavas_orientales.PNG Lenguas_eslavas_orientales], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP]. Try to cover all the areas where Russian is spoken at least as a major language. || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || split nor into nob and nno in pairviewer || Currently in [[pairviewer]], nor is displayed as a language separately from nob and nno. However, the nor pair actually consists of both an nob and an nno component. Figure out a way for pairviewer (or pairsOut.py / get_all_lang_pairs.py) to detect this split. So instead of having swe-nor, there would be swe-nob and swe-nno displayed (connected seemlessly with other nob-* and nno-* pairs), though the paths between the nodes would each still give information about the swe-nor pair. Implement a solution, trying to make sure it's future-proof (i.e., will work with similar sorts of things in the future). || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || add support to pairviewer for regional and alternate orthograpic modes || Currently in [[pairviewer]], there is no way to detect or display modes like zh_TW. Add suppor to pairsOut.py / get_all_lang_pairs.py to detect pairs containing abbreviations like this, as well as alternate orthographic modes in pairs (e.g. uzb_Latn and uzb_Cyrl). Also, figure out a way to display these nicely in the pairviewer's front-end. Get creative. I can imagine something like zh_CN and zh_TW nodes that are in some fixed relation to zho (think Mickey Mouse configuration?). Run some ideas by your mentor and implement what's decided on. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|}<br />
<br />
=== Begiak ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || Generalise phenny/begiak git plugin || Rename the module to git (instead of github), and test it to make sure it's general enough for at least three common git services (should already be supported, but double check) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin commit info function || Add a function to get the status of a commit by reponame and name (similar to what the svn module does), and then find out why commit 6a54157b89aee88511a260a849f104ae546e3a65 in turkiccorpora resulted in the following output, and fix it: Something went wrong: dict_keys(['commits', 'user', 'canon_url', 'repository', 'truncated']) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin recent function || Find out why the recent function (begiak: recent) returns "ValueError: No JSON object could be decoded (file "/usr/lib/python3.2/json/decoder.py", line 371, in raw_decode)" for one of the repos (no permission) and find a way to fix it so it returns the status instead. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin status || Add a function that lets anyone (not just admin) get the status of the git event server. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || Document phenny/begiak git plugin || Document the module: how to use it with each service it supports, and the various ways the module can be interacted with (by administrators and anyone) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || phenny/begiak svn plugin info function || Find out why the info function ("begiak info [repo] [rev]") doesn't work and fix it. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document any phenny/begiak command that does not have information || Find a command that our IRC bot uses that is not documented, and document how it works both on the [http://wiki.apertium.org/wiki/Begiak Begiak wiki page] and in the code. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count rlx sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in rlx files and output that to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count t*x sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in all .t*x files (for language pairs) and output the sum to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to report the revision of each monolingual file || Make the awikstats module of our IRC bot ([[begiak]]) report each file's svn revision for pairs with their own monodices, e.g. [[Apertium-en-es/stats]]. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue to support nick aliases || Make the tell/ask queue function of our IRC bot ([[begiak]]) support alises for nicks, so that e.g. spectre/spectie/spectei can get tell messages regardless of which nick they were sent to. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue support deleting items from queue || Allow a user who added something to the tell/ask queue of our IRC bot ([[begiak]]) to display a list of the messages s/he has queued and delete one of them. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue split long messages || Make our IRC bot ([[begiak]])'s tell/ask function split overly long messages into multiple ones for display so as to not exceed the max IRC message length. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak blacklist for url interceptor || Modify our IRC bot ([[begiak]])'s url interceptor module so that an optional blacklist (list of url regexes?) can be provided in the config file. The point is to make it not display titles for site urls we might copy/paste a lot and/or that are known not to provide useful information. An example might be ^http(s?)://svn.code.sf.net/ . For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak relevant wiki module handle urls for wikis || Make our IRC bot ([[begiak]])'s url interceptor check whether a url is a link to a known mediawiki site (wikipedia, wiktionary, apertium wiki) and redirect to the appropriate module. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak apertium wiki module search capability || Have our IRC bot ([[begiak]])'s awik plugin search the apertium wiki and return top hit if a page isn't found (like the wikipedia plugin). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak wiki modules tell result || Make a function for our IRC bot ([[begiak]]) that allows someone to point another user to a wiki page (apertium wiki or wikipedia), and have it give them the results (e.g. for mentors to point students to resources). It could be an extra function on the .wik and .awik modules. Make sure it allows for all wiki modes in those modules (e.g., .wik.ru) and is intuitive to use. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}} || find content that phenny/begiak wiki modules don't do a good job with || Identify at least 10 pages or sections on Wikipedia or the apertium wiki that the respective [[begiak]] module doesn't return good output for. These may include content where there's immediately a subsection, content where the first thing is a table or infobox, or content where the first . doesn't end the sentence. Document generalisable scenarios about what the preferred behaviour would be. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || write a mailing list reporter for phenny/begiak || Write a module for our IRC bot ([[begiak]]) that either polls mailing list archives or is triggered by email being sent to a local account. The idea is to have begiak report a short IRC-message-length summary when someone posts to one of our publicly-visible mailing lists, like apertium-stuff or apertium-turkic lists. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make phenny/begiak git and svn modules display urls || When a user asks to display revision information, have [[begiak]] (our IRC bot) include a link to information on the revision. For example, when displaying information for apertium repo revision r57171, include the url http://sourceforge.net/p/apertium/svn/57171/ , maybe even a shortened version. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || greeting function for phenny/begiak || Write a module that has [[begiak]] (our IRC bot) keep track of users, and when a user it hasn't seen before enters a channel it's monitoring, have it greet them with a custom message, such as "Welcome to #apertium, (user)! Please stick around for a while and someone will address any questions you have." You'll have to keep track of users for each channel, and you should make the message enablable by channel. Also, allow a user-specific greeting to be enabled (e.g., for the ap-vbox user). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || fix phenny/begiak seen function || When begiak is restart, the <tt>.seen</tt> command forgets when it's seen everyone. Have the module save the relevant information as needed to a database (using standard phenny methods) that gets reloaded when the module is loaded on a restart of the bot. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || improve phenny/begiak timezone data || Find a source of standard timezone abbreviations and have the time module for [[begiak]] (our IRC bot) scrape and use that data. You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add support for timezone conversion to phenny/begiak || Add timezone conversion to the time plugin for [[begiak]] (our IRC bot). It should accept a time in one timezone and a destination timezone, and convert the time, e.g. ".tz 335EST in CET" should return "835CET". For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add city name support phenny/begiak timezone plugin || Find a source that maps city names to timezone abbreviations and have the .tz command for [[begiak]] (our IRC bot) scrape and use that data (e.g., ".time Barcelona" should give the current time in CET). You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add analysis and generation modes to apertium translation begiak module || Add the ability for the apertium translation module that's part [[begiak]] (our IRC bot) to query morphological analysis and generation modes. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make begiak's version control monitoring channel specific || Our IRC bot ([[begiak]]) currently monitors a series of git and svn repositories. When a commit is made to a repository, the bot displays the commit in all channels. For this task, you should modify both of these modules (svn and git) so that repositories being monitored (listed in the config file) can be specified in a channel-specific way. However, it should default to the current behaviour—channel-specific settings should just override the global monitoring pattern. You should fork [https://github.com/jonorthwash/phenny the bot on github] to work on this task and send a pull request when you're done. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Apertium linguistic data ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the bilingual dictionary of a language pair XX-YY in the incubator by adding 50 word correspondences to it || Languages XX and YY may have rather large dictionaries but a small bilingual dictioanry. Add words to the bilingual dictionary and test that the new vocabulary works. [[/Grow bilingual|Read more]]... || [[User:Mlforcada]] <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair XX-YY by adding 50 words to its vocabulary || Add words to language pair XX-YY and test that the new vocabulary works. [[/Add words|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Xavivars]] [[User:Bech]] [[User:Jimregan|Jimregan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Japerez]] [[User:tunedal]] [[User:Juanpabl]] [[User:Youssefsan|Youssefsan]] [[User:Firespeaker]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Find translation bugs by using LanguageTool, and correct them || The LanguageTool grammar/style checker has great rule sets for Catalan. Run it on output from Apertium translation into Catalan and fix 5 mistakes. [[/Fix using LanguageTool|Read more]]... || <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Add/correct one structural transfer rule to an existing language pair || Add or correct a structural transfer rule to an existing language pair and test that it works. [[/Add transfer rule|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Juanpabl]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 lexical selection rules for a language pair already set up with lexical selection || Add 10 lexical selection rules to improve the lexical selection quality of a pair and test them to ensure that they work. [[/Add lexical-select rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Japerez]] [[User:Firespeaker]] [[User:Raveesh]](more mentors welcome) <br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair to use lexical selection and write 5 rules || First set up a language pair to use the new lexical selection module (this will involve changing configure scripts, makefile and [[modes]] file). Then write 5 lexical selection rules. [[/Setup and add lexical selection|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]] [[User:Fulup|Fulup]] [[User:pankajksharma]] (more mentors welcome) <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 constraint grammar rules to repair part-of-speech tagging errors || Find some tagging errors and write 10 constraint grammar rules to fix the errors. [[/Add constraint-grammar rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fulup|Fulup]] (more mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair such that it uses constraint grammar for part-of-speech tagging || Find a language pair that does not yet use constraint grammar, and set it up to use constraint grammar. After doing this, find some tagging errors and write five rules for resolving them. [[/Setup constraint grammar for a pair|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Compare Apertium with another MT system and improve it || This tasks aims at improving an Apertium language pair when a web-accessible system exists for it in the 'net. Particularly good if the system is (approximately) rule-based such as [http://www.lucysoftware.com/english/machine-translation/lucy-lt-kwik-translator-/ Lucy], [http://www.reverso.net/text_translation.aspx?lang=EN Reverso], [http://www.systransoft.com/free-online-translation Systran] or [http://www.freetranslation.com/ SDL Free Translation]: (1) Install the Apertium language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Collect a corpus of text (newspaper, wikipedia) Segment it in sentences (using e.g., libsegment-java or a similar processor and a [https://en.wikipedia.org/wiki/Segmentation_Rules_eXchange SRX] segmentation rule file borrowed from e.g. OmegaT) and put each sentence in a line. Run the corpus through Apertium and through the other system Select those sentences where both outputs are very similar (e.g, 90% coincident). Decide which one is better. If the other language is better than Apertium, think of what modification could be done for Apertium to produce the same output, and make 3 such modifications.|| [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} What's difficult about this language pair? || For a language pair that is not in trunk or staging such that you know well the two languages involved, write a document describing the main problems that Apertium developers would encounter when developing that language pair (for that, you need to know very well how Apertium works). Note that there may be two such documents, one for A→B and the other for B→A Prepare it in your user space in the Apertium wiki.It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Write a contrastive grammar || Using a grammar book/resource document 10 ways in which the grammar of two languages differ, with no fewer than 3 examples of each difference. Put it on the wiki under Language1_and_Language2/Contrastive_grammar. See [[Farsi_and_English/Pending_tests]] for an example of a contrastive grammar that a previous GCI student made. || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Hand annotate 250 words of running text. || Use [[apertium annotatrix]] to hand-annotate 250 words of running text from Wikipedia for a language of your choice. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || The most frequent Romance-to-Romance transfer rules || Study the .t1x transfer rule files of Romance language pairs and distill 5-10 common rules that are common to all of them, perhaps by rewriting them into some equivalent form || [[User:Mlforcada]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Tag and align Macedonian--Bulgarian corpus || Take a Macedonian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-mk-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Bulgarian inflections || Write a program to extract Bulgarian inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Bulgarian_nouns Category:Bulgarian nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair by allowing for alternative translations || Improve the quality of a language pair by (a) detecting 5 cases where the (only) translation provided by the bilingual dictionary is not adequate in a given context, (b) adding the lexical selection module to the language, and (c) writing effective lexical selection rules to exploit that context to select a better translation || [[User:Francis Tyers]] [[User:Mlforcada]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up (X)HTML formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up (X)HTML formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up wordprocessor (ODT, RTF) formatting || Sometimes, an Apertium language pair takes a valid ODT or RTF source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of ODT or RTF files for testing purposes. Make sure they are opened using LibreOffice/OpenOffice.org (4) translate the valid files with the language pair (5) check if the translated files are also valid ODT or RTF files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up wordprocessor (ODT, RTF) formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up wordprocessor formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Start a language pair involving Interlingua || Start a new language pair involving [https://en.wikipedia.org/wiki/Interlingua Interlingua] using the [http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO Apertium new language HOWTO]. Interlingua is the second most used "artificial" language, after Esperanto). As Interlingua is basically a Romance language, you can use a Romance language as the other language, and Romance-language dictionaries rules may be easily adapted. Include at least 50 very frequent words (including some grammatical words) and at least one noun--phrase transfer rule in the ia→X direction. || [[User:Mlforcada]] [[User:Youssefsan|Youssefsan]] (will reach out also to the interlingua community) <br />
|-<br />
| {{sc|research}} || Document materials for a language not yet on our wiki || Document materials for a language not yet on our wiki. This should look something like the page on [[Aromanian]]—i.e., all available dictionaries, grammars, corpora, machine translators, etc., print or digital, where available, whether Free, etc., as well as some scholarly articles regarding the language, especially if about computational resources. || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Corpus Collection for Sindhi language. || (1) Collect a Sindhi monolingual corpus and tag it (some sentences). (2)Look for parallel/comparable corpusof Sindhi & (English or Hindi or Urdu or other) , clean it and mention it document materials wiki page for Sindhi. || [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Macedonian corpus || Take a Albanian--Macedonian corpus, for example SETimes, tag it using the [[apertium-sq-mk]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Serbo-Croatian corpus || Take a Albanian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-sq-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Bulgarian corpus || Take a Albanian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sq-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--English corpus || Take a Albanian--English corpus, for example SETimes, tag it using the [[apertium-sq-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--Serbo-Croatian corpus || Take a Macedonian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-mk-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--English corpus || Take a Macedonian--English corpus, for example SETimes, tag it using the [[apertium-mk-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--Bulgarian corpus || Take a Serbo-Croatian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sh-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--English corpus || Take a Serbo-Croatian--English corpus, for example SETimes, tag it using the [[apertium-sh-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Bulgarian--English corpus || Take a Bulgarian--English corpus, for example SETimes, tag it using the [[apertium-bg-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek noun inflections || Write a program to extract Greek inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_nouns Category:Greek nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek verb inflections || Write a program to extract Greek inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_verbs Category:Greek verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek adjective inflections || Write a program to extract Greek inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_adjectives Category:Greek adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to convert the Giellatekno Faroese CG to Apertium tags || Write a program which converts the tagset of the Giellatekno Faroese constraint grammar. || [[User:Francis Tyers]] [[User:Trondtr]]<br />
|-<br />
| {{sc|quality}} || Import nouns from azmorph into apertium-aze || Take the nouns (excluding proper nouns) from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adjectives from azmorph into apertium-aze || Take the adjectives from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adverbs from azmorph into apertium-aze || Take the adverbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import verbs from azmorph into apertium-aze || Take the verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import misc categories from azmorph into apertium-aze || Take the categories that aren't nouns, proper nouns, adjectives, adverbs, and verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--English sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain text files (eng.FILENAME.txt) and (kaz.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--Russian sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain files (eng.FILENAME.txt) and (rus.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|}<br />
<br />
=== Data mangling ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion || Write a conversion module for an existing dictionary for apertium-dixtools. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion in python || Write a conversion module for an existing free bilingual dictionary to [[lttoolbox]] format using Python. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese noun inflections || Write a program to extract Faroese inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_nouns Category:Faroese nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese verb inflections || Write a program to extract Faroese inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_verbs Category:Faroese verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese adjective inflections || Write a program to extract Faroese inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_adjectives Category:Faroese adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || scraper for all wiktionary pages in a category || a script that returns urls of all pages in a wiktionary category recursively (e.g., http://en.wiktionary.org/wiki/Category:Bashkir_nouns should also include pages from http://en.wiktionary.org/wiki/Category:Bashkir_proper_nouns ) || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Bilingual dictionary from word alignments script || Write a script which takes [[GIZA++]] alignments and outputs a <code>.dix</code> file. The script should be able to reduce the number of tags, and also have some heuristics to test if a word is too-frequently aligned. || [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Scraper for free forum content || Write a script to scrape/capture all freely available content for a forum or forum category and dump it to an xml corpus file or text file. || [[User:Firespeaker]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} scrape a freely available dictionary using tesseract || Use tesseract to scrape a freely available dictionary that exists in some image format (pdf, djvu, etc.). Be sure to scrape grammatical information if available, as well stems (e.g., some dictionaries might provide entries like АЗНА·Х, where the stem is азна), and all possible translations. Ideally it should dump into something resembling [[bidix]] format, but if there's no grammatical information and no way to guess at it, some flat machine-readable format is fine. || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Write an aligner for UDHR || Write a script to align two translations of the [[UDHR]] (final destination: trunk/apertium-tools/udhr_aligner.py). It should take two UDHR translations and output a tmx file with one article per entry. It should use the xml formatted UDHRs available from [http://www.unicode.org/udhr/index_by_name.html http://www.unicode.org/udhr/index_by_name.html] as input and output the aligned texts in tmx format. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || script to generate dictionary from IDS data || Write a script that takes two lg_id codes, scrapes those dictionaries at [http://lingweb.eva.mpg.de/ids/ IDS], matches entries, and outputs a dictionary in [[bidix]] format || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Script to convert rapidwords dictionary to apertium bidix || Write a script (preferably in python3) that converts an arbitrary dictionary from [http://rapidwords.net/reports rapidwords.net] to apertium bidix format. Keep in mind that rapidwords dictionaries may contain more than two languages, while apertium dictionaries may only contain two languages, so the script should take an argument allowing the user to specify which languages to extract. Ideally, there should also be an argument that lists the languages available. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Script to convert simple bilingual dictionary entries to lttoolbox-style entries || Write a simple converter for lists of bilingual dictionary entries (one per line) so that one can use the shorthand notation <code>perro.n.m:dog.n</code> to generate lttoolbox-style entries of the form <code><e><l>perro<s n="n"/><s n="m"/></l><r>dog<s n="n"/></r></e></code>. You may start from [https://github.com/jimregan/internostrum-to-lttoolbox] if you wish. || [[User:mlforcada]]<br />
|}<br />
<br />
=== Misc ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|documentation}} || Installation instructions for missing GNU/Linux distributions or versions || Adapt installation instructions for a particular GNU/Linux or Unix-like distribution if the existing instructions in the Apertium wiki do not work or have bugs of some kind. Prepare it in your user space in the Apertium wiki. It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Installing Apertium in lightweight GNU/Linux distributions || Give instructions on how to install Apertium in one of the small or lightweight GNU/Linux distributions such as [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz], so that may be used in older machines || [[User:Mlforcada]] [[User:Bech]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome) <br />
|-<br />
| {{sc|documentation}} || Video guide to installation || Prepare a screencast or video about installing Apertium; make sure it uses a format that may be viewed with Free software. When approved by your mentor, upload it to youtube, making sure that you use the HTML5 format which may be viewed by modern browsers without having to use proprietary plugins such as Adobe Flash. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Apertium in 5 slides || Write a 5-slide HTML presentation (only needing a modern browser to be viewed and ready to be effectively "karaoked" by some else in 5 minutes or less: you can prove this with a screencast) in the language in which you write more fluently, which describes Apertium, how it works, and what makes it different from other machine translation systems. || [[User:Mlforcada]] [[User:Firespeaker]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Improved "Become a language-pair developer" document || Read the document [[Become_a_language_pair_developer_for_Apertium]] and think of ways to improve it (don't do this if you have not done any of the language pair tasks). Send comments to your mentor and/or repare it in your user space in the Apertium wiki. There will be a chance to change the document later in the Apertium Wiki. || [[User:Mlforcada]] [[User:Bech]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || An entry test for Apertium || Write 20 multiple-choice questions about Apertium. Each question will give 3 options of which only one is true, so that we can build an "Apertium exam" for future GSoC/GCI/developers. Optionally, add an explanation for the correct answer. || [[User:Mlforcada]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Apertium on Windows (installer) || Make an Apertium installer for Windows; it should at least support Windows 7/8 (x86 and x86-64). Remember to check in the source to SVN and make it easily upgradeable. Adding language pairs should also not be difficult. See the current (non-functional) [[Apertium guide for Windows users]] for inspiration. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|documentation}} || Apertium on Windows (docs) || Document the new Apertium installer for Windows on the [[Apertium guide for Windows users]]. This task requires the "Apertium on Windows (installer)" to be completed || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Light Apertium bootable ISO for small machines || Using [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz] or a similar lightweight GNU/Linux, produce the minimum-possible bootable live ISO or live USB image that contains the OS, minimum editing facilities, Apertium, and a language pair of your choice. Make sure no package that is not strictly necessary for Apertium to run is included.|| [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|- <br />
| {{sc|code}} || Apertium in XLIFF workflows || Write a shell script and (if possible, using the filter definition files found in the documentation) a filter that takes an [https://en.wikipedia.org/wiki/XLIFF XLIFF] file such as the ones representing a computer-aided translation job and populates with translations of all segments that are not translated, marking them clearly as machine-translated. || [[User:Mlforcada]] [[User:Espla]] [[User:Fsanchez]] [[User:Japerez]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up (X)HTML formatting || Sometimes, an Apertium language pair takes a valid HTML/XHTML source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of HTML/XHTML files for testing purposes. Make sure they are valid using an HTML/XHTML validator (4) translate the valid files with the language pair (5) check if the translated files are also valid HTML/XHTML files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || Investigate how orthographic modes on kk.wikipedia.org are implemented || [http://kk.wikipedia.org The Kazakh-language wikipedia] has a menu at the top for selecting alphabet (Кирил, Latın, توتە - for Cyrillic-, Latin-, and Arabic-script modes). This appears to be some sort of plugin that transliterates the text on the fly. Find out what it is and how it works, and then document it somewhere on the wiki. If this has already been documented elsewhere, point a link to that, but you still should summarise in your own words what exactly it is. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a transliteration plugin for mediawiki || Write a plugin similar in functionality (and perhaps implementation) to the way the [http://kk.wikipedia.org Kazakh-language wikipedia]'s orthography changing system works. It should be able to be directed to use any arbitrary mode from an apertium mode file installed in a pre-specified path on a server.|| [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} train tesseract on a language with no available tesseract data || Train tesseract (the OCR software) on a language that it hasn't previously been trained on. We're especially interested in languages with some coverage in apertium. We can provide images of text to train on. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|research}} || using language transducers for predictive text on Android || Investigate what it would take to add some sort of plugin to existing Android predictive text / keyboard framework(s?) that would allow the use of lttoolbox (or hfst? or libvoikko stuff?) transducers to be used to predict text and/or guess swipes (in "swype" or similar). Document your findings on the apertium wiki. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || custom predictive text keyboards for Android || Research and document on apertium's wiki the steps needed to design an application for Android that could load arbitrarily defined / pre-specified keyboard layouts (e.g., say I want to make custom keyboard layouts for [[Kumyk]] and [[Guaraní]], and load either one into the same program) as well as either an lttoolbox-format transducer or a file easily generated from one that could be paired with a keyboard layout and used to predict text in that language. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} identify 75 substitutions for conversion from colloquial Finnish to book Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to come up with 75 examples of differences between colloquial Finnish and book Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} document the correspondences between the tagset used in the RNC tagged corpus and the Apertium tagset for Russian || The Apertium tagset for Russian and the RNC tagset are different, if we were able to make correspondences between them then we could compare our output against theirs. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Disambiguate 500 words of Russian text. || The objective of this task is to disambiguate by hand 500 words of text in Russian. You can find a Wikipedia article you are interested in, or you can be assigned one, you will be given the output of a morphological analyser for Russian, and your task is to select the most adequate analysis in context. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Convert 500 words of Finnish text in colloquial Finnish to literary Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to convert 500 words of text from colloquial Finnish to literary Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Research and document what it would take to migrate from svn to git || For this task, you should research and document succinctly on the [http://wiki.apertium.org/ apertium wiki] all the issues involved in moving our entire svn repository to git. It should cover issues like preserving commit histories and tags/releases, separating repositories for each module (and what constitutes a single module), how to migrate the entire codebase (including issues of timing/logistics), replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address why a problem exists and what sorts of things could be done to remedy it (with fairly specific solutions). You do not need to worry about what a full migration strategy might look like || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Come up with a potential migration strategy for apertium to move from svn to git || For this task, you should propose a hypothetical migration strategy for apertium to move from our current svn repository to a git repository and document the proposal on the [http://wiki.apertium.org/ apertium wiki]. The proposal should address the logistics and timing issues of anything that might come up in a migration of the entire codebase, including preserving commit histories and tags/releases, separating repositories for each module, replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address how to approach each problem and where on the timeline to take care of the issue. You do not need to worry about specific solutions to the various problems. || [[User:Firespeaker]]<br />
|}<br />
<br />
[[Category:Google Code-in]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Task_ideas_for_Google_Code-in&diff=50966Task ideas for Google Code-in2014-11-17T14:12:51Z<p>Ksnmi: /* Data mangling */</p>
<hr />
<div>{{TOCD}}<br />
This is the task ideas page for [http://www.google-melange.com/gci/homepage/google/gci2014 Google Code-in], here you can find ideas on interesting tasks that will improve your knowledge of Apertium and help you get into the world of open-source development.<br />
<br />
The people column lists people who you should get in contact with to request further information. All tasks are 2 hours maximum estimated amount of time that would be spent on the task by an experienced developer, however:<br />
<br />
# '''this does not include time taken to [[Minimal installation from SVN|install]] / set up apertium'''.<br />
# this is the time expected to take by an experienced developer, you may find that you spend more time on the task because of the learning curve. <br />
<br />
<!--Если ты не понимаешь английский язык или предпочитаешь работать над русским языком или другими языками России, смотри: [[Task ideas for Google Code-in/Russian]]--><br />
'''Categories:'''<br />
<br />
* {{sc|code}}: Tasks related to writing or refactoring code<br />
* {{sc|documentation}}: Tasks related to creating/editing documents and helping others learn more<br />
* {{sc|research}}: Tasks related to community management, outreach/marketting, or studying problems and recommending solutions<br />
* {{sc|quality}}: Tasks related to testing and ensuring code is of high quality.<br />
* {{sc|interface}}: Tasks related to user experience research or user interface design and interaction<br />
<br />
You can find descriptions of some of the mentors here: [[List_of_Apertium_mentors]].<br />
<br />
==Task list==<br />
<br />
=== Misc tools ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || Unigram tagging mode for <code>apertium-tagger</code> || Edit the <code>apertium-tagger</code> code to allow for lexicalised unigram tagging. This would basically choose the most frequent analysis for each surface form of a word. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Data format for the unigram tagger || Come up with a binary storage format for the data used for the unigram tagger. It could be based on the existing <code>.prob</code> format. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Add tag combination back-off to unigram tagger. || Modify the unigram tagger to allow for back-off to tag sequence in the case that a given form is not found. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Prototype unigram tagger. || Write a simple unigram tagger in a language of your choice. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Training for unigram tagger || Write a program that trains a model suitable for use with the unigram tagger. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || make voikkospell understand apertium stream format input || Make voikkospell understand apertium stream format input, e.g. ^word/analysis1/analysis2$, voikkospell should only interpret the 'word' part to be spellchecked. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make voikkospell return output in apertium stream format || make voikkospell return output suggestions in apertium stream format, e.g. ^correctword$ or ^incorrectword/correct1/correct2$ || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || libvoikko support for OS X || Make a spell server for OS X's system-wide spell checker to use arbitrary languages through libvoikko. See https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/SpellCheck/Tasks/CreatingSpellServer.html#//apple_ref/doc/uid/20000770-BAJFBAAH for more information || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Ubuntu/debian || document how to set up libreoffice voikko working with a language on Ubuntu and debian || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Fedora || document how to set up libreoffice voikko working with a language on Fedora || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Windows || document how to set up libreoffice voikko working with a language on Windows || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on OS X || document how to set up libreoffice voikko working with a language on OS X || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document how to set up libenchant to work with libvoikko || Libenchant is a spellchecking wrapper. Set it up to work with libvoikko, a spellchecking backend, and document how you did it. You may want to use a spellchecking module available in apertium for testing. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || firefox/iceweasel plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] .<br />
|| [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || chrome/chromium plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] . || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || firefox/iceweasel plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || chrome/chromium plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || firefox/iceweasel plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || chrome/chromium plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|quality}} || make apertium-quality work with python3.3 on all platforms || migrate apertium-quality away from distribute to newer setup-tools so it installs correctly in more recent versions of python (known incompatible: python3.3 OS X, known compatible: MacPorts python3.2) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || Get bible aligner working (or rewrite it) || trunk/apertium-tools/bible_aligner.py - Should take two bible translations and output a tmx file with one verse per entry. There is a standard-ish plain-text bible translation format that we have bible translations in, and we have files that contain the names of verses of various languages mapped to English verse names || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || tesseract interface for apertium languages || Find out what it would take to integrate apertium or voikkospell into tesseract. Document thoroughly available options on the wiki. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|code}} || Syntax tree visualisation using GNU bison || Write a program which reads a grammar using bison, parses a sentence and outputs the syntax tree as text, or graphViz or something. Some example bison code can be found [https://svn.code.sf.net/p/apertium/svn/branches/transfer4 here]. || [[User:Francis Tyers]] [[User:Mlforcada]]<br />
|-<br />
| {{sc|code}} || make concordancer work with output of analyser || Allow [http://pastebin.com/raw.php?i=KG8ydLPZ spectie's concordancer] to accept an optional apertium mode and directory (implement via argparse). When it has these, it should run the corpus through that apertium mode and search against the resulting tags and lemmas as well as the surface forms. E.g., the form алдым might have the analysis via an apertium mode of ^алдым/алд{{tag|n><px1sg}}{{tag|nom}}/ал{{tag|v><tv}}{{tag|ifi><p1}}{{tag|sg}}, so a search for "px1sg" should bring up this word. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || convert a current transducer for a language using lexc+twol to a guesser || Figure out how to generate a guesser for a language module that uses lexc for morphotactics and twol for morphophonology (e.g., apertium-kaz). One approach to investigate would be to generate all the possible archiphoneme representations of a given form and run the lexc guesser on that. || [[User:Firespeaker]] [[User:Flammie]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in hfst || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in HFST. The script should take a language code and create a new directory with a minimal lexc file, a minimal twol file, and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in lttoolbox || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in lttoolbox. The script should take a language code and create a new directory with a minimal dix file and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium bilingual module || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium bilingual module. The script should take two language codes and create a new directory with a minimal dix file, a minimal lrx file, and minimal transfer (.t*x) files, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Write a script to explain an Apertium machine translation in terms of its parts || Write a script (preferably in python3 or bash/equivalent) that takes one text segment ''S'', applies a given Apertium system to it and to all its possible whole-word subsegments ''s'' (perhaps up to a certain maximum length) and outputs a list ''(s,t,i,j,k,l)'' of correspondences so that the result of applying Apertium to ''s'' is ''t'', ''t'' is a whole-word subsegment of ''T'', the Apertium translation of ''S'', ''i'' and ''j'' are the starting position and end position of ''s'' in ''S'' and ''k'' and ''l'' are hte starting position and the end postion of ''t'' in ''T''. The script should read ''S'', ''T'', two language codes and optionally a maximum length and generate the correspondences ''(s,t,i,j,k,l)'' one per line || [[User:mlforcada]]<br />
|}<br />
<br />
=== Website and apy ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || apertium-apy mode for geriaoueg (biltrans in context) || apertium-apy function that accepts a context (e.g., ±n ~words around word) and a position in the context of a word, gets biltrans output on entire context, and returns translation for the word || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || SSL/HTTPS for Apertium.org || The Apertium site itself is equipped with SSL. Get Piwik working on HTTPS as well. After that, default to the HTTPS site via Apache. See [http://sourceforge.net/p/apertium/tickets/41/ ticket 41] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Website translation in [[Html-tools]] (code) || Html-tools should detect when the user wants to translate a website (similar to how Google Translate does it) and switch to an interface (See "Website translation in [[Html-tools]] (interface)" task) and perform the translation. It should also make it so that new pages that the user navigates to are translated. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|interface}} || Website translation in [[Html-tools]] (interface) || Add an interface to Html-tools that shows a webpage in an <iframe> with translation options and a back button to return to text/document translation. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] crashing on iPads when copying text || Make it so that the Apertium site does not crash on iPads when copying text on any of the modes while maintaining semantic HTML. This task requires having access to an iPad. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] copying text on Windows Phone IE || Make it so that the Apertium site allows copying text on WP while maintaining semantic HTML. This task requires having access to an Windows Phone. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[APY]] API keys || Add API key support but don't overengineer it. See [http://sourceforge.net/p/apertium/tickets/31/ ticket 31] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Xavivars]] <br />
|-<br />
| {{sc|code}} || Localisation of tag attributes on [[Html-tools]] || The meta description tag isn't localized as of now since the text is an attribute. Search engines often display this as their snippet. A possible way to achieve this is using data-text="@content@description". See [http://sourceforge.net/p/apertium/tickets/29/ ticket 29] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] font issues || See [http://sourceforge.net/p/apertium/tickets/27/ ticket 27] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Detect target language || When changing the source language, the [[Html-tools]] UI will often show a bunch of greyed out buttons, and the user has to fish for possible languages in the right-hand side drop-down. This is confusing (user might think "are there no languages to translate into?") and annoying. A simple solution is to reorder the list so that all possible target languages are shown first, then the list of greyed-out languages. See [http://sourceforge.net/p/apertium/tickets/25/ ticket 25] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Maintaining order of user interactions on [[Html-tools]] || If a user clicks a new language choice while translation or detection is proceeding (AJAX callback has not yet returned), the original action will not be cancelled. Make it so that the first action is canceled and overridden by the second. See [http://sourceforge.net/p/apertium/tickets/9/ ticket 9] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Drag-n-drop file translation on [[Html-tools]] || See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || More file formats for [[APY]] || APY does not support DOC, XLS, PPT file translation that require the file being converted to the newer XML based formats through LibreOffice or equivalent and then back. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Improved file translation functionality for [[APY]] || APY needs logging and to be non-blocking for file translation. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|interface}} || Abstract the formatting for the [[Html-tools]] interface. || The Html-tools interface should be easily customisable so that people can make it look how they want. The task is to abstract the formatting and make one or more new stylesheets to change the appearance. This is basically making a way of "skinning" the interface. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|interface}} || [[Html-tools]] spell-checker interface || Add an enableable spell-checker module to the [[html-tools]] interface. Get fancy with jquery/etc. so that e.g., misspelled words are underlined in red and recommendations for each word are given in some sort of drop-down menu. Feel free to implement a dummy function for testing spelling to test the interface until the "Html-tools spell-checker code" task is complete. There is a half-done version available from last year that may just need to be cleaned up and integrated into the current html-tools code. See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[Html-tools]] spell-checker code || Add code to the [[html-tools]] interface that allows spell checking to be performed. Should send entire string, and be able to match each returned result to its appropriate input word. Should also update as new words are typed (but [https://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-tools/apertium-html-tools/assets/js/translator.js#l42 not on every keystroke]). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[libvoikko]] support for [[APY]] || Write a function for [[APY]] that checks the spelling of an input string and for each word returns whether the word is correct, and if unknown returns suggestions. Whether segmentation is done by the client or by apertium-apy will have to be figured out. You will also need to add scanning for spelling modes to the initialisation section. Try to find a sensible way to structure the requests and returned data with JSON. Add a switch to allow someone to turn off support for this (use argparse set_false). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] expanding textareas || The input textarea in the html-tools translation interface does not expand depending on the user's input even when there is significant whitespace remaining on the page. Improvements include varying the length of the textareas to fill up the viewport or expanding depending on input. Both the input and output textareas would have to maintain the same length for interface consistency. Different behavior may be desired on mobile. See [http://sourceforge.net/p/apertium/tickets/4/ ticket 4] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Performance tracking in [[APY]] || Add a way for [[APY]] to keep track of number of words in input and time between sending input to a pipeline and receiving output, for the last n (e.g., 100) requests, and write a function to return the average words per second over something<n (e.g., 10) requests. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Make [[APY]] use one lock per pipeline || Make [[APY]] use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Language variant picker in [[Html-tools]] || Displaying language variants as distinct languages in the translator language selector is awkward and repetitive. Allowing users to first select a language and then display radio buttons for choosing a variant below the relevant translation box, if relevant, provides a better user interface. See [http://sourceforge.net/p/apertium/tickets/1/ ticket 1] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Investigate how to implement HTML-translation that can deal with broken HTML || The old Apertium website had a 'surf-and-translate' feature, but it frequently broke on badly-behaved HTML. Investigate how similar web sites deal with broken HTML when rewriting the internal content of a (possible automatically generated) HTML page. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Add permalink capability for generation and analysis [[Html-tools]] || [[Html-tools]] currently has support for permalinks to various translation modes. For this task, you should add similar support for analysis and generation modes. I.e., a person should be able to simply send someone a link for e.g., the Kazakh morphological analyser. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Pair visualisations ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || fix pairviewer's 2- and 3-letter code conflation problems || [[pairviewer]] doesn't always conflate languages that have two codes. E.g. sv/swe, nb/nob, de/deu, da/dan, uk/ukr, et/est, nl/nld, he/heb, ar/ara, eus/eu are each two separate nodes, but should instead each be collapsed into one node. Figure out why this isn't happening and fix it. Also, implement an algorithm to generate 2-to-3-letter mappings for available languages based on having the identical language name in languages.json instead of loading the huge list from codes.json; try to make this as processor- and memory-efficient as possible. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || map support for pairviewer ("pairmapper") || Write a version of [[pairviewer]] that instead of connecting floating nodes, connects nodes on a map. I.e., it should plot the nodes to an interactive world map (only for languages whose coordinates are provided, in e.g. GeoJSON format), and then connect them with straight-lines (as opposed to the current curved lines). Use an open map framework, like [http://leafletjs.com leaflet], [http://polymaps.org polymaps], or [http://openlayers.org openlayers] || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || coordinates for Mongolic languages || Using the map [https://en.wikipedia.org/wiki/File:Linguistic_map_of_the_Mongolic_languages.png Linguistic map of the Mongolic languages.png], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format that can be loaded by pairmapper (or, e.g., converted to kml and loaded in google maps). The file should contain points that are a geographic "center" (locus) for where each Mongolic language on that map is spoken. Use the term "Khalkha" (iso 639-3 khk) for "Mongolisch", and find a better map for Buryat. You can use a capital city for bigger, national languages if you'd like (think Paris as a locus for French). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || draw languages as areas for pairmapper || Make a map interface that loads data (in e.g. GeoJSON or KML format) specifying areas where languages are spoken, as well as a single-point locus for the language, and displays the areas on the map (something like [http://leafletjs.com/examples/choropleth.html the way the states are displayed here]) with a node with language code (like for [[pairviewer]]) at the locus. This should be able to be integrated into pairmapper, the planned map version of pairviewer. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Tatar, Bashqort, and Chuvash || Using the maps listed here, try to define rough areas for where Tatar, Bashqort, and Chuvash are spoken. These areas should be specified in a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. Try to be fairly accurate and detailed. Maps to consult include [https://commons.wikimedia.org/wiki/File:Tatarbashkirs1989ru.PNG Tatarsbashkirs1989ru], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP] || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus Turkic languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Kumyk, Nogay, Karachay, Balkar. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for IE and Mongolic Caucasus-area languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Ossetian, Armenian, Kalmyk. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Avar, Chechen, Abkhaz, Georgian. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Kazakh || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Kazakh is spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Uzbek and Uyghur || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Uzbek and Uyghur are spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference areas Russian is spoken || Assume areas in Central Asia with any sort of measurable Russian population speak Russian. Use the following maps to create a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin: [https://commons.wikimedia.org/wiki/File:Kazakhstan_European_2012_Rus.png Kazakhstan_European_2012_Rus], [https://commons.wikimedia.org/wiki/File:Ethnicrussians1989ru.PNG Ethnicrussians1989ru], [https://commons.wikimedia.org/wiki/File:Lenguas_eslavas_orientales.PNG Lenguas_eslavas_orientales], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP]. Try to cover all the areas where Russian is spoken at least as a major language. || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || split nor into nob and nno in pairviewer || Currently in [[pairviewer]], nor is displayed as a language separately from nob and nno. However, the nor pair actually consists of both an nob and an nno component. Figure out a way for pairviewer (or pairsOut.py / get_all_lang_pairs.py) to detect this split. So instead of having swe-nor, there would be swe-nob and swe-nno displayed (connected seemlessly with other nob-* and nno-* pairs), though the paths between the nodes would each still give information about the swe-nor pair. Implement a solution, trying to make sure it's future-proof (i.e., will work with similar sorts of things in the future). || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || add support to pairviewer for regional and alternate orthograpic modes || Currently in [[pairviewer]], there is no way to detect or display modes like zh_TW. Add suppor to pairsOut.py / get_all_lang_pairs.py to detect pairs containing abbreviations like this, as well as alternate orthographic modes in pairs (e.g. uzb_Latn and uzb_Cyrl). Also, figure out a way to display these nicely in the pairviewer's front-end. Get creative. I can imagine something like zh_CN and zh_TW nodes that are in some fixed relation to zho (think Mickey Mouse configuration?). Run some ideas by your mentor and implement what's decided on. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|}<br />
<br />
=== Begiak ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || Generalise phenny/begiak git plugin || Rename the module to git (instead of github), and test it to make sure it's general enough for at least three common git services (should already be supported, but double check) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin commit info function || Add a function to get the status of a commit by reponame and name (similar to what the svn module does), and then find out why commit 6a54157b89aee88511a260a849f104ae546e3a65 in turkiccorpora resulted in the following output, and fix it: Something went wrong: dict_keys(['commits', 'user', 'canon_url', 'repository', 'truncated']) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin recent function || Find out why the recent function (begiak: recent) returns "ValueError: No JSON object could be decoded (file "/usr/lib/python3.2/json/decoder.py", line 371, in raw_decode)" for one of the repos (no permission) and find a way to fix it so it returns the status instead. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin status || Add a function that lets anyone (not just admin) get the status of the git event server. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || Document phenny/begiak git plugin || Document the module: how to use it with each service it supports, and the various ways the module can be interacted with (by administrators and anyone) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || phenny/begiak svn plugin info function || Find out why the info function ("begiak info [repo] [rev]") doesn't work and fix it. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document any phenny/begiak command that does not have information || Find a command that our IRC bot uses that is not documented, and document how it works both on the [http://wiki.apertium.org/wiki/Begiak Begiak wiki page] and in the code. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count rlx sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in rlx files and output that to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count t*x sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in all .t*x files (for language pairs) and output the sum to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to report the revision of each monolingual file || Make the awikstats module of our IRC bot ([[begiak]]) report each file's svn revision for pairs with their own monodices, e.g. [[Apertium-en-es/stats]]. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue to support nick aliases || Make the tell/ask queue function of our IRC bot ([[begiak]]) support alises for nicks, so that e.g. spectre/spectie/spectei can get tell messages regardless of which nick they were sent to. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue support deleting items from queue || Allow a user who added something to the tell/ask queue of our IRC bot ([[begiak]]) to display a list of the messages s/he has queued and delete one of them. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue split long messages || Make our IRC bot ([[begiak]])'s tell/ask function split overly long messages into multiple ones for display so as to not exceed the max IRC message length. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak blacklist for url interceptor || Modify our IRC bot ([[begiak]])'s url interceptor module so that an optional blacklist (list of url regexes?) can be provided in the config file. The point is to make it not display titles for site urls we might copy/paste a lot and/or that are known not to provide useful information. An example might be ^http(s?)://svn.code.sf.net/ . For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak relevant wiki module handle urls for wikis || Make our IRC bot ([[begiak]])'s url interceptor check whether a url is a link to a known mediawiki site (wikipedia, wiktionary, apertium wiki) and redirect to the appropriate module. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak apertium wiki module search capability || Have our IRC bot ([[begiak]])'s awik plugin search the apertium wiki and return top hit if a page isn't found (like the wikipedia plugin). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak wiki modules tell result || Make a function for our IRC bot ([[begiak]]) that allows someone to point another user to a wiki page (apertium wiki or wikipedia), and have it give them the results (e.g. for mentors to point students to resources). It could be an extra function on the .wik and .awik modules. Make sure it allows for all wiki modes in those modules (e.g., .wik.ru) and is intuitive to use. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}} || find content that phenny/begiak wiki modules don't do a good job with || Identify at least 10 pages or sections on Wikipedia or the apertium wiki that the respective [[begiak]] module doesn't return good output for. These may include content where there's immediately a subsection, content where the first thing is a table or infobox, or content where the first . doesn't end the sentence. Document generalisable scenarios about what the preferred behaviour would be. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || write a mailing list reporter for phenny/begiak || Write a module for our IRC bot ([[begiak]]) that either polls mailing list archives or is triggered by email being sent to a local account. The idea is to have begiak report a short IRC-message-length summary when someone posts to one of our publicly-visible mailing lists, like apertium-stuff or apertium-turkic lists. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make phenny/begiak git and svn modules display urls || When a user asks to display revision information, have [[begiak]] (our IRC bot) include a link to information on the revision. For example, when displaying information for apertium repo revision r57171, include the url http://sourceforge.net/p/apertium/svn/57171/ , maybe even a shortened version. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || greeting function for phenny/begiak || Write a module that has [[begiak]] (our IRC bot) keep track of users, and when a user it hasn't seen before enters a channel it's monitoring, have it greet them with a custom message, such as "Welcome to #apertium, (user)! Please stick around for a while and someone will address any questions you have." You'll have to keep track of users for each channel, and you should make the message enablable by channel. Also, allow a user-specific greeting to be enabled (e.g., for the ap-vbox user). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || fix phenny/begiak seen function || When begiak is restart, the <tt>.seen</tt> command forgets when it's seen everyone. Have the module save the relevant information as needed to a database (using standard phenny methods) that gets reloaded when the module is loaded on a restart of the bot. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || improve phenny/begiak timezone data || Find a source of standard timezone abbreviations and have the time module for [[begiak]] (our IRC bot) scrape and use that data. You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add support for timezone conversion to phenny/begiak || Add timezone conversion to the time plugin for [[begiak]] (our IRC bot). It should accept a time in one timezone and a destination timezone, and convert the time, e.g. ".tz 335EST in CET" should return "835CET". For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add city name support phenny/begiak timezone plugin || Find a source that maps city names to timezone abbreviations and have the .tz command for [[begiak]] (our IRC bot) scrape and use that data (e.g., ".time Barcelona" should give the current time in CET). You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add analysis and generation modes to apertium translation begiak module || Add the ability for the apertium translation module that's part [[begiak]] (our IRC bot) to query morphological analysis and generation modes. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make begiak's version control monitoring channel specific || Our IRC bot ([[begiak]]) currently monitors a series of git and svn repositories. When a commit is made to a repository, the bot displays the commit in all channels. For this task, you should modify both of these modules (svn and git) so that repositories being monitored (listed in the config file) can be specified in a channel-specific way. However, it should default to the current behaviour—channel-specific settings should just override the global monitoring pattern. You should fork [https://github.com/jonorthwash/phenny the bot on github] to work on this task and send a pull request when you're done. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Apertium linguistic data ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the bilingual dictionary of a language pair XX-YY in the incubator by adding 50 word correspondences to it || Languages XX and YY may have rather large dictionaries but a small bilingual dictioanry. Add words to the bilingual dictionary and test that the new vocabulary works. [[/Grow bilingual|Read more]]... || [[User:Mlforcada]] <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair XX-YY by adding 50 words to its vocabulary || Add words to language pair XX-YY and test that the new vocabulary works. [[/Add words|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Xavivars]] [[User:Bech]] [[User:Jimregan|Jimregan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Japerez]] [[User:tunedal]] [[User:Juanpabl]] [[User:Youssefsan|Youssefsan]] [[User:Firespeaker]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Find translation bugs by using LanguageTool, and correct them || The LanguageTool grammar/style checker has great rule sets for Catalan. Run it on output from Apertium translation into Catalan and fix 5 mistakes. [[/Fix using LanguageTool|Read more]]... || <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Add/correct one structural transfer rule to an existing language pair || Add or correct a structural transfer rule to an existing language pair and test that it works. [[/Add transfer rule|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Juanpabl]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 lexical selection rules for a language pair already set up with lexical selection || Add 10 lexical selection rules to improve the lexical selection quality of a pair and test them to ensure that they work. [[/Add lexical-select rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Japerez]] [[User:Firespeaker]] [[User:Raveesh]](more mentors welcome) <br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair to use lexical selection and write 5 rules || First set up a language pair to use the new lexical selection module (this will involve changing configure scripts, makefile and [[modes]] file). Then write 5 lexical selection rules. [[/Setup and add lexical selection|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]] [[User:Fulup|Fulup]] [[User:pankajksharma]] (more mentors welcome) <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 constraint grammar rules to repair part-of-speech tagging errors || Find some tagging errors and write 10 constraint grammar rules to fix the errors. [[/Add constraint-grammar rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fulup|Fulup]] (more mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair such that it uses constraint grammar for part-of-speech tagging || Find a language pair that does not yet use constraint grammar, and set it up to use constraint grammar. After doing this, find some tagging errors and write five rules for resolving them. [[/Setup constraint grammar for a pair|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Compare Apertium with another MT system and improve it || This tasks aims at improving an Apertium language pair when a web-accessible system exists for it in the 'net. Particularly good if the system is (approximately) rule-based such as [http://www.lucysoftware.com/english/machine-translation/lucy-lt-kwik-translator-/ Lucy], [http://www.reverso.net/text_translation.aspx?lang=EN Reverso], [http://www.systransoft.com/free-online-translation Systran] or [http://www.freetranslation.com/ SDL Free Translation]: (1) Install the Apertium language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Collect a corpus of text (newspaper, wikipedia) Segment it in sentences (using e.g., libsegment-java or a similar processor and a [https://en.wikipedia.org/wiki/Segmentation_Rules_eXchange SRX] segmentation rule file borrowed from e.g. OmegaT) and put each sentence in a line. Run the corpus through Apertium and through the other system Select those sentences where both outputs are very similar (e.g, 90% coincident). Decide which one is better. If the other language is better than Apertium, think of what modification could be done for Apertium to produce the same output, and make 3 such modifications.|| [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} What's difficult about this language pair? || For a language pair that is not in trunk or staging such that you know well the two languages involved, write a document describing the main problems that Apertium developers would encounter when developing that language pair (for that, you need to know very well how Apertium works). Note that there may be two such documents, one for A→B and the other for B→A Prepare it in your user space in the Apertium wiki.It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Write a contrastive grammar || Using a grammar book/resource document 10 ways in which the grammar of two languages differ, with no fewer than 3 examples of each difference. Put it on the wiki under Language1_and_Language2/Contrastive_grammar. See [[Farsi_and_English/Pending_tests]] for an example of a contrastive grammar that a previous GCI student made. || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Hand annotate 250 words of running text. || Use [[apertium annotatrix]] to hand-annotate 250 words of running text from Wikipedia for a language of your choice. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || The most frequent Romance-to-Romance transfer rules || Study the .t1x transfer rule files of Romance language pairs and distill 5-10 common rules that are common to all of them, perhaps by rewriting them into some equivalent form || [[User:Mlforcada]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Tag and align Macedonian--Bulgarian corpus || Take a Macedonian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-mk-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Bulgarian inflections || Write a program to extract Bulgarian inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Bulgarian_nouns Category:Bulgarian nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair by allowing for alternative translations || Improve the quality of a language pair by (a) detecting 5 cases where the (only) translation provided by the bilingual dictionary is not adequate in a given context, (b) adding the lexical selection module to the language, and (c) writing effective lexical selection rules to exploit that context to select a better translation || [[User:Francis Tyers]] [[User:Mlforcada]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up (X)HTML formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up (X)HTML formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up wordprocessor (ODT, RTF) formatting || Sometimes, an Apertium language pair takes a valid ODT or RTF source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of ODT or RTF files for testing purposes. Make sure they are opened using LibreOffice/OpenOffice.org (4) translate the valid files with the language pair (5) check if the translated files are also valid ODT or RTF files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up wordprocessor (ODT, RTF) formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up wordprocessor formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Start a language pair involving Interlingua || Start a new language pair involving [https://en.wikipedia.org/wiki/Interlingua Interlingua] using the [http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO Apertium new language HOWTO]. Interlingua is the second most used "artificial" language, after Esperanto). As Interlingua is basically a Romance language, you can use a Romance language as the other language, and Romance-language dictionaries rules may be easily adapted. Include at least 50 very frequent words (including some grammatical words) and at least one noun--phrase transfer rule in the ia→X direction. || [[User:Mlforcada]] [[User:Youssefsan|Youssefsan]] (will reach out also to the interlingua community) <br />
|-<br />
| {{sc|research}} || Document materials for a language not yet on our wiki || Document materials for a language not yet on our wiki. This should look something like the page on [[Aromanian]]—i.e., all available dictionaries, grammars, corpora, machine translators, etc., print or digital, where available, whether Free, etc., as well as some scholarly articles regarding the language, especially if about computational resources. || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Corpus Collection for Sindhi language. || (1) Collect a Sindhi monolingual corpus and tag it (some sentences). (2)Look for parallel/comparable corpusof Sindhi & (English or Hindi or Urdu or other) , clean it and mention it document materials wiki page for Sindhi. || [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Macedonian corpus || Take a Albanian--Macedonian corpus, for example SETimes, tag it using the [[apertium-sq-mk]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Serbo-Croatian corpus || Take a Albanian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-sq-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Bulgarian corpus || Take a Albanian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sq-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--English corpus || Take a Albanian--English corpus, for example SETimes, tag it using the [[apertium-sq-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--Serbo-Croatian corpus || Take a Macedonian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-mk-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--English corpus || Take a Macedonian--English corpus, for example SETimes, tag it using the [[apertium-mk-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--Bulgarian corpus || Take a Serbo-Croatian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sh-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--English corpus || Take a Serbo-Croatian--English corpus, for example SETimes, tag it using the [[apertium-sh-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Bulgarian--English corpus || Take a Bulgarian--English corpus, for example SETimes, tag it using the [[apertium-bg-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek noun inflections || Write a program to extract Greek inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_nouns Category:Greek nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek verb inflections || Write a program to extract Greek inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_verbs Category:Greek verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek adjective inflections || Write a program to extract Greek inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_adjectives Category:Greek adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to convert the Giellatekno Faroese CG to Apertium tags || Write a program which converts the tagset of the Giellatekno Faroese constraint grammar. || [[User:Francis Tyers]] [[User:Trondtr]]<br />
|-<br />
| {{sc|quality}} || Import nouns from azmorph into apertium-aze || Take the nouns (excluding proper nouns) from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adjectives from azmorph into apertium-aze || Take the adjectives from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adverbs from azmorph into apertium-aze || Take the adverbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import verbs from azmorph into apertium-aze || Take the verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import misc categories from azmorph into apertium-aze || Take the categories that aren't nouns, proper nouns, adjectives, adverbs, and verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--English sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain text files (eng.FILENAME.txt) and (kaz.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--Russian sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain files (eng.FILENAME.txt) and (rus.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|}<br />
<br />
=== Data mangling ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion || Write a conversion module for an existing dictionary for apertium-dixtools. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion in python || Write a conversion module for an existing free bilingual dictionary to [[lttoolbox]] format using Python. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese noun inflections || Write a program to extract Faroese inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_nouns Category:Faroese nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese verb inflections || Write a program to extract Faroese inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_verbs Category:Faroese verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese adjective inflections || Write a program to extract Faroese inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_adjectives Category:Faroese adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || scraper for all wiktionary pages in a category || a script that returns urls of all pages in a wiktionary category recursively (e.g., http://en.wiktionary.org/wiki/Category:Bashkir_nouns should also include pages from http://en.wiktionary.org/wiki/Category:Bashkir_proper_nouns ) || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Bilingual dictionary from word alignments script || Write a script which takes [[GIZA++]] alignments and outputs a <code>.dix</code> file. The script should be able to reduce the number of tags, and also have some heuristics to test if a word is too-frequently aligned. || [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Scraper for free forum content || Write a script to scrape/capture all freely available content for a forum or forum category and dump it to an xml corpus file or text file. || [[User:Firespeaker]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} scrape a freely available dictionary using tesseract || Use tesseract to scrape a freely available dictionary that exists in some image format (pdf, djvu, etc.). Be sure to scrape grammatical information if available, as well stems (e.g., some dictionaries might provide entries like АЗНА·Х, where the stem is азна), and all possible translations. Ideally it should dump into something resembling [[bidix]] format, but if there's no grammatical information and no way to guess at it, some flat machine-readable format is fine. || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Write an aligner for UDHR || Write a script to align two translations of the [[UDHR]] (final destination: trunk/apertium-tools/udhr_aligner.py). It should take two UDHR translations and output a tmx file with one article per entry. It should use the xml formatted UDHRs available from [http://www.unicode.org/udhr/index_by_name.html http://www.unicode.org/udhr/index_by_name.html] as input and output the aligned texts in tmx format. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || script to generate dictionary from IDS data || Write a script that takes two lg_id codes, scrapes those dictionaries at [http://lingweb.eva.mpg.de/ids/ IDS], matches entries, and outputs a dictionary in [[bidix]] format || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Script to convert rapidwords dictionary to apertium bidix || Write a script (preferably in python3) that converts an arbitrary dictionary from [http://rapidwords.net/reports rapidwords.net] to apertium bidix format. Keep in mind that rapidwords dictionaries may contain more than two languages, while apertium dictionaries may only contain two languages, so the script should take an argument allowing the user to specify which languages to extract. Ideally, there should also be an argument that lists the languages available. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Script to convert simple bilingual dictionary entries to lttoolbox-style entries || Write a simple converter for lists of bilingual dictionary entries (one per line) so that one can use the shorthand notation <code>perro.n.m:dog.n</code> to generate lttoolbox-style entries of the form <code><e><l>perro<s n="n"/><s n="m"/></l><r>dog<s n="n"/></r></e></code>. You may start from [https://github.com/jimregan/internostrum-to-lttoolbox] if you wish. || [[User:mlforcada]]<br />
|}<br />
<br />
=== Misc ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|documentation}} || Installation instructions for missing GNU/Linux distributions or versions || Adapt installation instructions for a particular GNU/Linux or Unix-like distribution if the existing instructions in the Apertium wiki do not work or have bugs of some kind. Prepare it in your user space in the Apertium wiki. It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Installing Apertium in lightweight GNU/Linux distributions || Give instructions on how to install Apertium in one of the small or lightweight GNU/Linux distributions such as [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz], so that may be used in older machines || [[User:Mlforcada]] [[User:Bech]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome) <br />
|-<br />
| {{sc|documentation}} || Video guide to installation || Prepare a screencast or video about installing Apertium; make sure it uses a format that may be viewed with Free software. When approved by your mentor, upload it to youtube, making sure that you use the HTML5 format which may be viewed by modern browsers without having to use proprietary plugins such as Adobe Flash. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Apertium in 5 slides || Write a 5-slide HTML presentation (only needing a modern browser to be viewed and ready to be effectively "karaoked" by some else in 5 minutes or less: you can prove this with a screencast) in the language in which you write more fluently, which describes Apertium, how it works, and what makes it different from other machine translation systems. || [[User:Mlforcada]] [[User:Firespeaker]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Improved "Become a language-pair developer" document || Read the document [[Become_a_language_pair_developer_for_Apertium]] and think of ways to improve it (don't do this if you have not done any of the language pair tasks). Send comments to your mentor and/or repare it in your user space in the Apertium wiki. There will be a chance to change the document later in the Apertium Wiki. || [[User:Mlforcada]] [[User:Bech]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || An entry test for Apertium || Write 20 multiple-choice questions about Apertium. Each question will give 3 options of which only one is true, so that we can build an "Apertium exam" for future GSoC/GCI/developers. Optionally, add an explanation for the correct answer. || [[User:Mlforcada]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Apertium on Windows (installer) || Make an Apertium installer for Windows; it should at least support Windows 7/8 (x86 and x86-64). Remember to check in the source to SVN and make it easily upgradeable. Adding language pairs should also not be difficult. See the current (non-functional) [[Apertium guide for Windows users]] for inspiration. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|documentation}} || Apertium on Windows (docs) || Document the new Apertium installer for Windows on the [[Apertium guide for Windows users]]. This task requires the "Apertium on Windows (installer)" to be completed || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Light Apertium bootable ISO for small machines || Using [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz] or a similar lightweight GNU/Linux, produce the minimum-possible bootable live ISO or live USB image that contains the OS, minimum editing facilities, Apertium, and a language pair of your choice. Make sure no package that is not strictly necessary for Apertium to run is included.|| [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|- <br />
| {{sc|code}} || Apertium in XLIFF workflows || Write a shell script and (if possible, using the filter definition files found in the documentation) a filter that takes an [https://en.wikipedia.org/wiki/XLIFF XLIFF] file such as the ones representing a computer-aided translation job and populates with translations of all segments that are not translated, marking them clearly as machine-translated. || [[User:Mlforcada]] [[User:Espla]] [[User:Fsanchez]] [[User:Japerez]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up (X)HTML formatting || Sometimes, an Apertium language pair takes a valid HTML/XHTML source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of HTML/XHTML files for testing purposes. Make sure they are valid using an HTML/XHTML validator (4) translate the valid files with the language pair (5) check if the translated files are also valid HTML/XHTML files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || Investigate how orthographic modes on kk.wikipedia.org are implemented || [http://kk.wikipedia.org The Kazakh-language wikipedia] has a menu at the top for selecting alphabet (Кирил, Latın, توتە - for Cyrillic-, Latin-, and Arabic-script modes). This appears to be some sort of plugin that transliterates the text on the fly. Find out what it is and how it works, and then document it somewhere on the wiki. If this has already been documented elsewhere, point a link to that, but you still should summarise in your own words what exactly it is. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a transliteration plugin for mediawiki || Write a plugin similar in functionality (and perhaps implementation) to the way the [http://kk.wikipedia.org Kazakh-language wikipedia]'s orthography changing system works. It should be able to be directed to use any arbitrary mode from an apertium mode file installed in a pre-specified path on a server.|| [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} train tesseract on a language with no available tesseract data || Train tesseract (the OCR software) on a language that it hasn't previously been trained on. We're especially interested in languages with some coverage in apertium. We can provide images of text to train on. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|research}} || using language transducers for predictive text on Android || Investigate what it would take to add some sort of plugin to existing Android predictive text / keyboard framework(s?) that would allow the use of lttoolbox (or hfst? or libvoikko stuff?) transducers to be used to predict text and/or guess swipes (in "swype" or similar). Document your findings on the apertium wiki. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || custom predictive text keyboards for Android || Research and document on apertium's wiki the steps needed to design an application for Android that could load arbitrarily defined / pre-specified keyboard layouts (e.g., say I want to make custom keyboard layouts for [[Kumyk]] and [[Guaraní]], and load either one into the same program) as well as either an lttoolbox-format transducer or a file easily generated from one that could be paired with a keyboard layout and used to predict text in that language. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} identify 75 substitutions for conversion from colloquial Finnish to book Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to come up with 75 examples of differences between colloquial Finnish and book Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} document the correspondences between the tagset used in the RNC tagged corpus and the Apertium tagset for Russian || The Apertium tagset for Russian and the RNC tagset are different, if we were able to make correspondences between them then we could compare our output against theirs. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Disambiguate 500 words of Russian text. || The objective of this task is to disambiguate by hand 500 words of text in Russian. You can find a Wikipedia article you are interested in, or you can be assigned one, you will be given the output of a morphological analyser for Russian, and your task is to select the most adequate analysis in context. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Convert 500 words of Finnish text in colloquial Finnish to literary Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to convert 500 words of text from colloquial Finnish to literary Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Research and document what it would take to migrate from svn to git || For this task, you should research and document succinctly on the [http://wiki.apertium.org/ apertium wiki] all the issues involved in moving our entire svn repository to git. It should cover issues like preserving commit histories and tags/releases, separating repositories for each module (and what constitutes a single module), how to migrate the entire codebase (including issues of timing/logistics), replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address why a problem exists and what sorts of things could be done to remedy it (with fairly specific solutions). You do not need to worry about what a full migration strategy might look like || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Come up with a potential migration strategy for apertium to move from svn to git || For this task, you should propose a hypothetical migration strategy for apertium to move from our current svn repository to a git repository and document the proposal on the [http://wiki.apertium.org/ apertium wiki]. The proposal should address the logistics and timing issues of anything that might come up in a migration of the entire codebase, including preserving commit histories and tags/releases, separating repositories for each module, replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address how to approach each problem and where on the timeline to take care of the issue. You do not need to worry about specific solutions to the various problems. || [[User:Firespeaker]]<br />
|}<br />
<br />
[[Category:Google Code-in]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Task_ideas_for_Google_Code-in&diff=50965Task ideas for Google Code-in2014-11-17T13:57:07Z<p>Ksnmi: /* Data mangling */</p>
<hr />
<div>{{TOCD}}<br />
This is the task ideas page for [http://www.google-melange.com/gci/homepage/google/gci2014 Google Code-in], here you can find ideas on interesting tasks that will improve your knowledge of Apertium and help you get into the world of open-source development.<br />
<br />
The people column lists people who you should get in contact with to request further information. All tasks are 2 hours maximum estimated amount of time that would be spent on the task by an experienced developer, however:<br />
<br />
# '''this does not include time taken to [[Minimal installation from SVN|install]] / set up apertium'''.<br />
# this is the time expected to take by an experienced developer, you may find that you spend more time on the task because of the learning curve. <br />
<br />
<!--Если ты не понимаешь английский язык или предпочитаешь работать над русским языком или другими языками России, смотри: [[Task ideas for Google Code-in/Russian]]--><br />
'''Categories:'''<br />
<br />
* {{sc|code}}: Tasks related to writing or refactoring code<br />
* {{sc|documentation}}: Tasks related to creating/editing documents and helping others learn more<br />
* {{sc|research}}: Tasks related to community management, outreach/marketting, or studying problems and recommending solutions<br />
* {{sc|quality}}: Tasks related to testing and ensuring code is of high quality.<br />
* {{sc|interface}}: Tasks related to user experience research or user interface design and interaction<br />
<br />
You can find descriptions of some of the mentors here: [[List_of_Apertium_mentors]].<br />
<br />
==Task list==<br />
<br />
=== Misc tools ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || Unigram tagging mode for <code>apertium-tagger</code> || Edit the <code>apertium-tagger</code> code to allow for lexicalised unigram tagging. This would basically choose the most frequent analysis for each surface form of a word. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Data format for the unigram tagger || Come up with a binary storage format for the data used for the unigram tagger. It could be based on the existing <code>.prob</code> format. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Add tag combination back-off to unigram tagger. || Modify the unigram tagger to allow for back-off to tag sequence in the case that a given form is not found. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Prototype unigram tagger. || Write a simple unigram tagger in a language of your choice. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Training for unigram tagger || Write a program that trains a model suitable for use with the unigram tagger. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || make voikkospell understand apertium stream format input || Make voikkospell understand apertium stream format input, e.g. ^word/analysis1/analysis2$, voikkospell should only interpret the 'word' part to be spellchecked. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make voikkospell return output in apertium stream format || make voikkospell return output suggestions in apertium stream format, e.g. ^correctword$ or ^incorrectword/correct1/correct2$ || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || libvoikko support for OS X || Make a spell server for OS X's system-wide spell checker to use arbitrary languages through libvoikko. See https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/SpellCheck/Tasks/CreatingSpellServer.html#//apple_ref/doc/uid/20000770-BAJFBAAH for more information || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Ubuntu/debian || document how to set up libreoffice voikko working with a language on Ubuntu and debian || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Fedora || document how to set up libreoffice voikko working with a language on Fedora || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Windows || document how to set up libreoffice voikko working with a language on Windows || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on OS X || document how to set up libreoffice voikko working with a language on OS X || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document how to set up libenchant to work with libvoikko || Libenchant is a spellchecking wrapper. Set it up to work with libvoikko, a spellchecking backend, and document how you did it. You may want to use a spellchecking module available in apertium for testing. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || firefox/iceweasel plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] .<br />
|| [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || chrome/chromium plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] . || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || firefox/iceweasel plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || chrome/chromium plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || firefox/iceweasel plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || chrome/chromium plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|quality}} || make apertium-quality work with python3.3 on all platforms || migrate apertium-quality away from distribute to newer setup-tools so it installs correctly in more recent versions of python (known incompatible: python3.3 OS X, known compatible: MacPorts python3.2) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || Get bible aligner working (or rewrite it) || trunk/apertium-tools/bible_aligner.py - Should take two bible translations and output a tmx file with one verse per entry. There is a standard-ish plain-text bible translation format that we have bible translations in, and we have files that contain the names of verses of various languages mapped to English verse names || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || tesseract interface for apertium languages || Find out what it would take to integrate apertium or voikkospell into tesseract. Document thoroughly available options on the wiki. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|code}} || Syntax tree visualisation using GNU bison || Write a program which reads a grammar using bison, parses a sentence and outputs the syntax tree as text, or graphViz or something. Some example bison code can be found [https://svn.code.sf.net/p/apertium/svn/branches/transfer4 here]. || [[User:Francis Tyers]] [[User:Mlforcada]]<br />
|-<br />
| {{sc|code}} || make concordancer work with output of analyser || Allow [http://pastebin.com/raw.php?i=KG8ydLPZ spectie's concordancer] to accept an optional apertium mode and directory (implement via argparse). When it has these, it should run the corpus through that apertium mode and search against the resulting tags and lemmas as well as the surface forms. E.g., the form алдым might have the analysis via an apertium mode of ^алдым/алд{{tag|n><px1sg}}{{tag|nom}}/ал{{tag|v><tv}}{{tag|ifi><p1}}{{tag|sg}}, so a search for "px1sg" should bring up this word. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || convert a current transducer for a language using lexc+twol to a guesser || Figure out how to generate a guesser for a language module that uses lexc for morphotactics and twol for morphophonology (e.g., apertium-kaz). One approach to investigate would be to generate all the possible archiphoneme representations of a given form and run the lexc guesser on that. || [[User:Firespeaker]] [[User:Flammie]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in hfst || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in HFST. The script should take a language code and create a new directory with a minimal lexc file, a minimal twol file, and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in lttoolbox || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in lttoolbox. The script should take a language code and create a new directory with a minimal dix file and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium bilingual module || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium bilingual module. The script should take two language codes and create a new directory with a minimal dix file, a minimal lrx file, and minimal transfer (.t*x) files, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Write a script to explain an Apertium machine translation in terms of its parts || Write a script (preferably in python3 or bash/equivalent) that takes one text segment ''S'', applies a given Apertium system to it and to all its possible whole-word subsegments ''s'' (perhaps up to a certain maximum length) and outputs a list ''(s,t,i,j,k,l)'' of correspondences so that the result of applying Apertium to ''s'' is ''t'', ''t'' is a whole-word subsegment of ''T'', the Apertium translation of ''S'', ''i'' and ''j'' are the starting position and end position of ''s'' in ''S'' and ''k'' and ''l'' are hte starting position and the end postion of ''t'' in ''T''. The script should read ''S'', ''T'', two language codes and optionally a maximum length and generate the correspondences ''(s,t,i,j,k,l)'' one per line || [[User:mlforcada]]<br />
|}<br />
<br />
=== Website and apy ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || apertium-apy mode for geriaoueg (biltrans in context) || apertium-apy function that accepts a context (e.g., ±n ~words around word) and a position in the context of a word, gets biltrans output on entire context, and returns translation for the word || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || SSL/HTTPS for Apertium.org || The Apertium site itself is equipped with SSL. Get Piwik working on HTTPS as well. After that, default to the HTTPS site via Apache. See [http://sourceforge.net/p/apertium/tickets/41/ ticket 41] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Website translation in [[Html-tools]] (code) || Html-tools should detect when the user wants to translate a website (similar to how Google Translate does it) and switch to an interface (See "Website translation in [[Html-tools]] (interface)" task) and perform the translation. It should also make it so that new pages that the user navigates to are translated. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|interface}} || Website translation in [[Html-tools]] (interface) || Add an interface to Html-tools that shows a webpage in an <iframe> with translation options and a back button to return to text/document translation. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] crashing on iPads when copying text || Make it so that the Apertium site does not crash on iPads when copying text on any of the modes while maintaining semantic HTML. This task requires having access to an iPad. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] copying text on Windows Phone IE || Make it so that the Apertium site allows copying text on WP while maintaining semantic HTML. This task requires having access to an Windows Phone. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[APY]] API keys || Add API key support but don't overengineer it. See [http://sourceforge.net/p/apertium/tickets/31/ ticket 31] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Xavivars]] <br />
|-<br />
| {{sc|code}} || Localisation of tag attributes on [[Html-tools]] || The meta description tag isn't localized as of now since the text is an attribute. Search engines often display this as their snippet. A possible way to achieve this is using data-text="@content@description". See [http://sourceforge.net/p/apertium/tickets/29/ ticket 29] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] font issues || See [http://sourceforge.net/p/apertium/tickets/27/ ticket 27] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Detect target language || When changing the source language, the [[Html-tools]] UI will often show a bunch of greyed out buttons, and the user has to fish for possible languages in the right-hand side drop-down. This is confusing (user might think "are there no languages to translate into?") and annoying. A simple solution is to reorder the list so that all possible target languages are shown first, then the list of greyed-out languages. See [http://sourceforge.net/p/apertium/tickets/25/ ticket 25] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Maintaining order of user interactions on [[Html-tools]] || If a user clicks a new language choice while translation or detection is proceeding (AJAX callback has not yet returned), the original action will not be cancelled. Make it so that the first action is canceled and overridden by the second. See [http://sourceforge.net/p/apertium/tickets/9/ ticket 9] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Drag-n-drop file translation on [[Html-tools]] || See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || More file formats for [[APY]] || APY does not support DOC, XLS, PPT file translation that require the file being converted to the newer XML based formats through LibreOffice or equivalent and then back. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Improved file translation functionality for [[APY]] || APY needs logging and to be non-blocking for file translation. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|interface}} || Abstract the formatting for the [[Html-tools]] interface. || The Html-tools interface should be easily customisable so that people can make it look how they want. The task is to abstract the formatting and make one or more new stylesheets to change the appearance. This is basically making a way of "skinning" the interface. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|interface}} || [[Html-tools]] spell-checker interface || Add an enableable spell-checker module to the [[html-tools]] interface. Get fancy with jquery/etc. so that e.g., misspelled words are underlined in red and recommendations for each word are given in some sort of drop-down menu. Feel free to implement a dummy function for testing spelling to test the interface until the "Html-tools spell-checker code" task is complete. There is a half-done version available from last year that may just need to be cleaned up and integrated into the current html-tools code. See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[Html-tools]] spell-checker code || Add code to the [[html-tools]] interface that allows spell checking to be performed. Should send entire string, and be able to match each returned result to its appropriate input word. Should also update as new words are typed (but [https://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-tools/apertium-html-tools/assets/js/translator.js#l42 not on every keystroke]). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[libvoikko]] support for [[APY]] || Write a function for [[APY]] that checks the spelling of an input string and for each word returns whether the word is correct, and if unknown returns suggestions. Whether segmentation is done by the client or by apertium-apy will have to be figured out. You will also need to add scanning for spelling modes to the initialisation section. Try to find a sensible way to structure the requests and returned data with JSON. Add a switch to allow someone to turn off support for this (use argparse set_false). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] expanding textareas || The input textarea in the html-tools translation interface does not expand depending on the user's input even when there is significant whitespace remaining on the page. Improvements include varying the length of the textareas to fill up the viewport or expanding depending on input. Both the input and output textareas would have to maintain the same length for interface consistency. Different behavior may be desired on mobile. See [http://sourceforge.net/p/apertium/tickets/4/ ticket 4] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Performance tracking in [[APY]] || Add a way for [[APY]] to keep track of number of words in input and time between sending input to a pipeline and receiving output, for the last n (e.g., 100) requests, and write a function to return the average words per second over something<n (e.g., 10) requests. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Make [[APY]] use one lock per pipeline || Make [[APY]] use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Language variant picker in [[Html-tools]] || Displaying language variants as distinct languages in the translator language selector is awkward and repetitive. Allowing users to first select a language and then display radio buttons for choosing a variant below the relevant translation box, if relevant, provides a better user interface. See [http://sourceforge.net/p/apertium/tickets/1/ ticket 1] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Investigate how to implement HTML-translation that can deal with broken HTML || The old Apertium website had a 'surf-and-translate' feature, but it frequently broke on badly-behaved HTML. Investigate how similar web sites deal with broken HTML when rewriting the internal content of a (possible automatically generated) HTML page. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Add permalink capability for generation and analysis [[Html-tools]] || [[Html-tools]] currently has support for permalinks to various translation modes. For this task, you should add similar support for analysis and generation modes. I.e., a person should be able to simply send someone a link for e.g., the Kazakh morphological analyser. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Pair visualisations ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || fix pairviewer's 2- and 3-letter code conflation problems || [[pairviewer]] doesn't always conflate languages that have two codes. E.g. sv/swe, nb/nob, de/deu, da/dan, uk/ukr, et/est, nl/nld, he/heb, ar/ara, eus/eu are each two separate nodes, but should instead each be collapsed into one node. Figure out why this isn't happening and fix it. Also, implement an algorithm to generate 2-to-3-letter mappings for available languages based on having the identical language name in languages.json instead of loading the huge list from codes.json; try to make this as processor- and memory-efficient as possible. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || map support for pairviewer ("pairmapper") || Write a version of [[pairviewer]] that instead of connecting floating nodes, connects nodes on a map. I.e., it should plot the nodes to an interactive world map (only for languages whose coordinates are provided, in e.g. GeoJSON format), and then connect them with straight-lines (as opposed to the current curved lines). Use an open map framework, like [http://leafletjs.com leaflet], [http://polymaps.org polymaps], or [http://openlayers.org openlayers] || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || coordinates for Mongolic languages || Using the map [https://en.wikipedia.org/wiki/File:Linguistic_map_of_the_Mongolic_languages.png Linguistic map of the Mongolic languages.png], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format that can be loaded by pairmapper (or, e.g., converted to kml and loaded in google maps). The file should contain points that are a geographic "center" (locus) for where each Mongolic language on that map is spoken. Use the term "Khalkha" (iso 639-3 khk) for "Mongolisch", and find a better map for Buryat. You can use a capital city for bigger, national languages if you'd like (think Paris as a locus for French). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || draw languages as areas for pairmapper || Make a map interface that loads data (in e.g. GeoJSON or KML format) specifying areas where languages are spoken, as well as a single-point locus for the language, and displays the areas on the map (something like [http://leafletjs.com/examples/choropleth.html the way the states are displayed here]) with a node with language code (like for [[pairviewer]]) at the locus. This should be able to be integrated into pairmapper, the planned map version of pairviewer. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Tatar, Bashqort, and Chuvash || Using the maps listed here, try to define rough areas for where Tatar, Bashqort, and Chuvash are spoken. These areas should be specified in a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. Try to be fairly accurate and detailed. Maps to consult include [https://commons.wikimedia.org/wiki/File:Tatarbashkirs1989ru.PNG Tatarsbashkirs1989ru], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP] || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus Turkic languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Kumyk, Nogay, Karachay, Balkar. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for IE and Mongolic Caucasus-area languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Ossetian, Armenian, Kalmyk. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Avar, Chechen, Abkhaz, Georgian. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Kazakh || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Kazakh is spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Uzbek and Uyghur || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Uzbek and Uyghur are spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference areas Russian is spoken || Assume areas in Central Asia with any sort of measurable Russian population speak Russian. Use the following maps to create a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin: [https://commons.wikimedia.org/wiki/File:Kazakhstan_European_2012_Rus.png Kazakhstan_European_2012_Rus], [https://commons.wikimedia.org/wiki/File:Ethnicrussians1989ru.PNG Ethnicrussians1989ru], [https://commons.wikimedia.org/wiki/File:Lenguas_eslavas_orientales.PNG Lenguas_eslavas_orientales], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP]. Try to cover all the areas where Russian is spoken at least as a major language. || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || split nor into nob and nno in pairviewer || Currently in [[pairviewer]], nor is displayed as a language separately from nob and nno. However, the nor pair actually consists of both an nob and an nno component. Figure out a way for pairviewer (or pairsOut.py / get_all_lang_pairs.py) to detect this split. So instead of having swe-nor, there would be swe-nob and swe-nno displayed (connected seemlessly with other nob-* and nno-* pairs), though the paths between the nodes would each still give information about the swe-nor pair. Implement a solution, trying to make sure it's future-proof (i.e., will work with similar sorts of things in the future). || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || add support to pairviewer for regional and alternate orthograpic modes || Currently in [[pairviewer]], there is no way to detect or display modes like zh_TW. Add suppor to pairsOut.py / get_all_lang_pairs.py to detect pairs containing abbreviations like this, as well as alternate orthographic modes in pairs (e.g. uzb_Latn and uzb_Cyrl). Also, figure out a way to display these nicely in the pairviewer's front-end. Get creative. I can imagine something like zh_CN and zh_TW nodes that are in some fixed relation to zho (think Mickey Mouse configuration?). Run some ideas by your mentor and implement what's decided on. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|}<br />
<br />
=== Begiak ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || Generalise phenny/begiak git plugin || Rename the module to git (instead of github), and test it to make sure it's general enough for at least three common git services (should already be supported, but double check) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin commit info function || Add a function to get the status of a commit by reponame and name (similar to what the svn module does), and then find out why commit 6a54157b89aee88511a260a849f104ae546e3a65 in turkiccorpora resulted in the following output, and fix it: Something went wrong: dict_keys(['commits', 'user', 'canon_url', 'repository', 'truncated']) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin recent function || Find out why the recent function (begiak: recent) returns "ValueError: No JSON object could be decoded (file "/usr/lib/python3.2/json/decoder.py", line 371, in raw_decode)" for one of the repos (no permission) and find a way to fix it so it returns the status instead. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin status || Add a function that lets anyone (not just admin) get the status of the git event server. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || Document phenny/begiak git plugin || Document the module: how to use it with each service it supports, and the various ways the module can be interacted with (by administrators and anyone) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || phenny/begiak svn plugin info function || Find out why the info function ("begiak info [repo] [rev]") doesn't work and fix it. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document any phenny/begiak command that does not have information || Find a command that our IRC bot uses that is not documented, and document how it works both on the [http://wiki.apertium.org/wiki/Begiak Begiak wiki page] and in the code. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count rlx sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in rlx files and output that to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count t*x sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in all .t*x files (for language pairs) and output the sum to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to report the revision of each monolingual file || Make the awikstats module of our IRC bot ([[begiak]]) report each file's svn revision for pairs with their own monodices, e.g. [[Apertium-en-es/stats]]. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue to support nick aliases || Make the tell/ask queue function of our IRC bot ([[begiak]]) support alises for nicks, so that e.g. spectre/spectie/spectei can get tell messages regardless of which nick they were sent to. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue support deleting items from queue || Allow a user who added something to the tell/ask queue of our IRC bot ([[begiak]]) to display a list of the messages s/he has queued and delete one of them. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue split long messages || Make our IRC bot ([[begiak]])'s tell/ask function split overly long messages into multiple ones for display so as to not exceed the max IRC message length. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak blacklist for url interceptor || Modify our IRC bot ([[begiak]])'s url interceptor module so that an optional blacklist (list of url regexes?) can be provided in the config file. The point is to make it not display titles for site urls we might copy/paste a lot and/or that are known not to provide useful information. An example might be ^http(s?)://svn.code.sf.net/ . For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak relevant wiki module handle urls for wikis || Make our IRC bot ([[begiak]])'s url interceptor check whether a url is a link to a known mediawiki site (wikipedia, wiktionary, apertium wiki) and redirect to the appropriate module. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak apertium wiki module search capability || Have our IRC bot ([[begiak]])'s awik plugin search the apertium wiki and return top hit if a page isn't found (like the wikipedia plugin). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak wiki modules tell result || Make a function for our IRC bot ([[begiak]]) that allows someone to point another user to a wiki page (apertium wiki or wikipedia), and have it give them the results (e.g. for mentors to point students to resources). It could be an extra function on the .wik and .awik modules. Make sure it allows for all wiki modes in those modules (e.g., .wik.ru) and is intuitive to use. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}} || find content that phenny/begiak wiki modules don't do a good job with || Identify at least 10 pages or sections on Wikipedia or the apertium wiki that the respective [[begiak]] module doesn't return good output for. These may include content where there's immediately a subsection, content where the first thing is a table or infobox, or content where the first . doesn't end the sentence. Document generalisable scenarios about what the preferred behaviour would be. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || write a mailing list reporter for phenny/begiak || Write a module for our IRC bot ([[begiak]]) that either polls mailing list archives or is triggered by email being sent to a local account. The idea is to have begiak report a short IRC-message-length summary when someone posts to one of our publicly-visible mailing lists, like apertium-stuff or apertium-turkic lists. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make phenny/begiak git and svn modules display urls || When a user asks to display revision information, have [[begiak]] (our IRC bot) include a link to information on the revision. For example, when displaying information for apertium repo revision r57171, include the url http://sourceforge.net/p/apertium/svn/57171/ , maybe even a shortened version. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || greeting function for phenny/begiak || Write a module that has [[begiak]] (our IRC bot) keep track of users, and when a user it hasn't seen before enters a channel it's monitoring, have it greet them with a custom message, such as "Welcome to #apertium, (user)! Please stick around for a while and someone will address any questions you have." You'll have to keep track of users for each channel, and you should make the message enablable by channel. Also, allow a user-specific greeting to be enabled (e.g., for the ap-vbox user). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || fix phenny/begiak seen function || When begiak is restart, the <tt>.seen</tt> command forgets when it's seen everyone. Have the module save the relevant information as needed to a database (using standard phenny methods) that gets reloaded when the module is loaded on a restart of the bot. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || improve phenny/begiak timezone data || Find a source of standard timezone abbreviations and have the time module for [[begiak]] (our IRC bot) scrape and use that data. You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add support for timezone conversion to phenny/begiak || Add timezone conversion to the time plugin for [[begiak]] (our IRC bot). It should accept a time in one timezone and a destination timezone, and convert the time, e.g. ".tz 335EST in CET" should return "835CET". For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add city name support phenny/begiak timezone plugin || Find a source that maps city names to timezone abbreviations and have the .tz command for [[begiak]] (our IRC bot) scrape and use that data (e.g., ".time Barcelona" should give the current time in CET). You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add analysis and generation modes to apertium translation begiak module || Add the ability for the apertium translation module that's part [[begiak]] (our IRC bot) to query morphological analysis and generation modes. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make begiak's version control monitoring channel specific || Our IRC bot ([[begiak]]) currently monitors a series of git and svn repositories. When a commit is made to a repository, the bot displays the commit in all channels. For this task, you should modify both of these modules (svn and git) so that repositories being monitored (listed in the config file) can be specified in a channel-specific way. However, it should default to the current behaviour—channel-specific settings should just override the global monitoring pattern. You should fork [https://github.com/jonorthwash/phenny the bot on github] to work on this task and send a pull request when you're done. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Apertium linguistic data ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the bilingual dictionary of a language pair XX-YY in the incubator by adding 50 word correspondences to it || Languages XX and YY may have rather large dictionaries but a small bilingual dictioanry. Add words to the bilingual dictionary and test that the new vocabulary works. [[/Grow bilingual|Read more]]... || [[User:Mlforcada]] <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair XX-YY by adding 50 words to its vocabulary || Add words to language pair XX-YY and test that the new vocabulary works. [[/Add words|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Xavivars]] [[User:Bech]] [[User:Jimregan|Jimregan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Japerez]] [[User:tunedal]] [[User:Juanpabl]] [[User:Youssefsan|Youssefsan]] [[User:Firespeaker]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Find translation bugs by using LanguageTool, and correct them || The LanguageTool grammar/style checker has great rule sets for Catalan. Run it on output from Apertium translation into Catalan and fix 5 mistakes. [[/Fix using LanguageTool|Read more]]... || <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Add/correct one structural transfer rule to an existing language pair || Add or correct a structural transfer rule to an existing language pair and test that it works. [[/Add transfer rule|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Juanpabl]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 lexical selection rules for a language pair already set up with lexical selection || Add 10 lexical selection rules to improve the lexical selection quality of a pair and test them to ensure that they work. [[/Add lexical-select rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Japerez]] [[User:Firespeaker]] [[User:Raveesh]](more mentors welcome) <br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair to use lexical selection and write 5 rules || First set up a language pair to use the new lexical selection module (this will involve changing configure scripts, makefile and [[modes]] file). Then write 5 lexical selection rules. [[/Setup and add lexical selection|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]] [[User:Fulup|Fulup]] [[User:pankajksharma]] (more mentors welcome) <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 constraint grammar rules to repair part-of-speech tagging errors || Find some tagging errors and write 10 constraint grammar rules to fix the errors. [[/Add constraint-grammar rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fulup|Fulup]] (more mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair such that it uses constraint grammar for part-of-speech tagging || Find a language pair that does not yet use constraint grammar, and set it up to use constraint grammar. After doing this, find some tagging errors and write five rules for resolving them. [[/Setup constraint grammar for a pair|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Compare Apertium with another MT system and improve it || This tasks aims at improving an Apertium language pair when a web-accessible system exists for it in the 'net. Particularly good if the system is (approximately) rule-based such as [http://www.lucysoftware.com/english/machine-translation/lucy-lt-kwik-translator-/ Lucy], [http://www.reverso.net/text_translation.aspx?lang=EN Reverso], [http://www.systransoft.com/free-online-translation Systran] or [http://www.freetranslation.com/ SDL Free Translation]: (1) Install the Apertium language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Collect a corpus of text (newspaper, wikipedia) Segment it in sentences (using e.g., libsegment-java or a similar processor and a [https://en.wikipedia.org/wiki/Segmentation_Rules_eXchange SRX] segmentation rule file borrowed from e.g. OmegaT) and put each sentence in a line. Run the corpus through Apertium and through the other system Select those sentences where both outputs are very similar (e.g, 90% coincident). Decide which one is better. If the other language is better than Apertium, think of what modification could be done for Apertium to produce the same output, and make 3 such modifications.|| [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} What's difficult about this language pair? || For a language pair that is not in trunk or staging such that you know well the two languages involved, write a document describing the main problems that Apertium developers would encounter when developing that language pair (for that, you need to know very well how Apertium works). Note that there may be two such documents, one for A→B and the other for B→A Prepare it in your user space in the Apertium wiki.It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Write a contrastive grammar || Using a grammar book/resource document 10 ways in which the grammar of two languages differ, with no fewer than 3 examples of each difference. Put it on the wiki under Language1_and_Language2/Contrastive_grammar. See [[Farsi_and_English/Pending_tests]] for an example of a contrastive grammar that a previous GCI student made. || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Hand annotate 250 words of running text. || Use [[apertium annotatrix]] to hand-annotate 250 words of running text from Wikipedia for a language of your choice. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || The most frequent Romance-to-Romance transfer rules || Study the .t1x transfer rule files of Romance language pairs and distill 5-10 common rules that are common to all of them, perhaps by rewriting them into some equivalent form || [[User:Mlforcada]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Tag and align Macedonian--Bulgarian corpus || Take a Macedonian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-mk-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Bulgarian inflections || Write a program to extract Bulgarian inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Bulgarian_nouns Category:Bulgarian nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair by allowing for alternative translations || Improve the quality of a language pair by (a) detecting 5 cases where the (only) translation provided by the bilingual dictionary is not adequate in a given context, (b) adding the lexical selection module to the language, and (c) writing effective lexical selection rules to exploit that context to select a better translation || [[User:Francis Tyers]] [[User:Mlforcada]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up (X)HTML formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up (X)HTML formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up wordprocessor (ODT, RTF) formatting || Sometimes, an Apertium language pair takes a valid ODT or RTF source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of ODT or RTF files for testing purposes. Make sure they are opened using LibreOffice/OpenOffice.org (4) translate the valid files with the language pair (5) check if the translated files are also valid ODT or RTF files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up wordprocessor (ODT, RTF) formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up wordprocessor formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Start a language pair involving Interlingua || Start a new language pair involving [https://en.wikipedia.org/wiki/Interlingua Interlingua] using the [http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO Apertium new language HOWTO]. Interlingua is the second most used "artificial" language, after Esperanto). As Interlingua is basically a Romance language, you can use a Romance language as the other language, and Romance-language dictionaries rules may be easily adapted. Include at least 50 very frequent words (including some grammatical words) and at least one noun--phrase transfer rule in the ia→X direction. || [[User:Mlforcada]] [[User:Youssefsan|Youssefsan]] (will reach out also to the interlingua community) <br />
|-<br />
| {{sc|research}} || Document materials for a language not yet on our wiki || Document materials for a language not yet on our wiki. This should look something like the page on [[Aromanian]]—i.e., all available dictionaries, grammars, corpora, machine translators, etc., print or digital, where available, whether Free, etc., as well as some scholarly articles regarding the language, especially if about computational resources. || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Corpus Collection for Sindhi language. || (1) Collect a Sindhi monolingual corpus and tag it (some sentences). (2)Look for parallel/comparable corpusof Sindhi & (English or Hindi or Urdu or other) , clean it and mention it document materials wiki page for Sindhi. || [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Macedonian corpus || Take a Albanian--Macedonian corpus, for example SETimes, tag it using the [[apertium-sq-mk]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Serbo-Croatian corpus || Take a Albanian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-sq-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Bulgarian corpus || Take a Albanian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sq-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--English corpus || Take a Albanian--English corpus, for example SETimes, tag it using the [[apertium-sq-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--Serbo-Croatian corpus || Take a Macedonian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-mk-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--English corpus || Take a Macedonian--English corpus, for example SETimes, tag it using the [[apertium-mk-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--Bulgarian corpus || Take a Serbo-Croatian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sh-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--English corpus || Take a Serbo-Croatian--English corpus, for example SETimes, tag it using the [[apertium-sh-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Bulgarian--English corpus || Take a Bulgarian--English corpus, for example SETimes, tag it using the [[apertium-bg-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek noun inflections || Write a program to extract Greek inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_nouns Category:Greek nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek verb inflections || Write a program to extract Greek inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_verbs Category:Greek verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek adjective inflections || Write a program to extract Greek inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_adjectives Category:Greek adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to convert the Giellatekno Faroese CG to Apertium tags || Write a program which converts the tagset of the Giellatekno Faroese constraint grammar. || [[User:Francis Tyers]] [[User:Trondtr]]<br />
|-<br />
| {{sc|quality}} || Import nouns from azmorph into apertium-aze || Take the nouns (excluding proper nouns) from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adjectives from azmorph into apertium-aze || Take the adjectives from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adverbs from azmorph into apertium-aze || Take the adverbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import verbs from azmorph into apertium-aze || Take the verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import misc categories from azmorph into apertium-aze || Take the categories that aren't nouns, proper nouns, adjectives, adverbs, and verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--English sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain text files (eng.FILENAME.txt) and (kaz.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--Russian sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain files (eng.FILENAME.txt) and (rus.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|}<br />
<br />
=== Data mangling ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion || Write a conversion module for an existing dictionary for apertium-dixtools. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion in python || Write a conversion module for an existing free bilingual dictionary to [[lttoolbox]] format using Python. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese noun inflections || Write a program to extract Faroese inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_nouns Category:Faroese nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese verb inflections || Write a program to extract Faroese inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_verbs Category:Faroese verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese adjective inflections || Write a program to extract Faroese inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_adjectives Category:Faroese adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || scraper for all wiktionary pages in a category || a script that returns urls of all pages in a wiktionary category recursively (e.g., http://en.wiktionary.org/wiki/Category:Bashkir_nouns should also include pages from http://en.wiktionary.org/wiki/Category:Bashkir_proper_nouns ) || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Bilingual dictionary from word alignments script || Write a script which takes [[GIZA++]] alignments and outputs a <code>.dix</code> file. The script should be able to reduce the number of tags, and also have some heuristics to test if a word is too-frequently aligned. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Scraper for free forum content || Write a script to scrape/capture all freely available content for a forum or forum category and dump it to an xml corpus file or text file. || [[User:Firespeaker]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} scrape a freely available dictionary using tesseract || Use tesseract to scrape a freely available dictionary that exists in some image format (pdf, djvu, etc.). Be sure to scrape grammatical information if available, as well stems (e.g., some dictionaries might provide entries like АЗНА·Х, where the stem is азна), and all possible translations. Ideally it should dump into something resembling [[bidix]] format, but if there's no grammatical information and no way to guess at it, some flat machine-readable format is fine. || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Write an aligner for UDHR || Write a script to align two translations of the [[UDHR]] (final destination: trunk/apertium-tools/udhr_aligner.py). It should take two UDHR translations and output a tmx file with one article per entry. It should use the xml formatted UDHRs available from [http://www.unicode.org/udhr/index_by_name.html http://www.unicode.org/udhr/index_by_name.html] as input and output the aligned texts in tmx format. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || script to generate dictionary from IDS data || Write a script that takes two lg_id codes, scrapes those dictionaries at [http://lingweb.eva.mpg.de/ids/ IDS], matches entries, and outputs a dictionary in [[bidix]] format || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Script to convert rapidwords dictionary to apertium bidix || Write a script (preferably in python3) that converts an arbitrary dictionary from [http://rapidwords.net/reports rapidwords.net] to apertium bidix format. Keep in mind that rapidwords dictionaries may contain more than two languages, while apertium dictionaries may only contain two languages, so the script should take an argument allowing the user to specify which languages to extract. Ideally, there should also be an argument that lists the languages available. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Script to convert simple bilingual dictionary entries to lttoolbox-style entries || Write a simple converter for lists of bilingual dictionary entries (one per line) so that one can use the shorthand notation <code>perro.n.m:dog.n</code> to generate lttoolbox-style entries of the form <code><e><l>perro<s n="n"/><s n="m"/></l><r>dog<s n="n"/></r></e></code>. You may start from [https://github.com/jimregan/internostrum-to-lttoolbox] if you wish. || [[User:mlforcada]]<br />
|}<br />
<br />
=== Misc ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|documentation}} || Installation instructions for missing GNU/Linux distributions or versions || Adapt installation instructions for a particular GNU/Linux or Unix-like distribution if the existing instructions in the Apertium wiki do not work or have bugs of some kind. Prepare it in your user space in the Apertium wiki. It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Installing Apertium in lightweight GNU/Linux distributions || Give instructions on how to install Apertium in one of the small or lightweight GNU/Linux distributions such as [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz], so that may be used in older machines || [[User:Mlforcada]] [[User:Bech]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome) <br />
|-<br />
| {{sc|documentation}} || Video guide to installation || Prepare a screencast or video about installing Apertium; make sure it uses a format that may be viewed with Free software. When approved by your mentor, upload it to youtube, making sure that you use the HTML5 format which may be viewed by modern browsers without having to use proprietary plugins such as Adobe Flash. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Apertium in 5 slides || Write a 5-slide HTML presentation (only needing a modern browser to be viewed and ready to be effectively "karaoked" by some else in 5 minutes or less: you can prove this with a screencast) in the language in which you write more fluently, which describes Apertium, how it works, and what makes it different from other machine translation systems. || [[User:Mlforcada]] [[User:Firespeaker]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Improved "Become a language-pair developer" document || Read the document [[Become_a_language_pair_developer_for_Apertium]] and think of ways to improve it (don't do this if you have not done any of the language pair tasks). Send comments to your mentor and/or repare it in your user space in the Apertium wiki. There will be a chance to change the document later in the Apertium Wiki. || [[User:Mlforcada]] [[User:Bech]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || An entry test for Apertium || Write 20 multiple-choice questions about Apertium. Each question will give 3 options of which only one is true, so that we can build an "Apertium exam" for future GSoC/GCI/developers. Optionally, add an explanation for the correct answer. || [[User:Mlforcada]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Apertium on Windows (installer) || Make an Apertium installer for Windows; it should at least support Windows 7/8 (x86 and x86-64). Remember to check in the source to SVN and make it easily upgradeable. Adding language pairs should also not be difficult. See the current (non-functional) [[Apertium guide for Windows users]] for inspiration. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|documentation}} || Apertium on Windows (docs) || Document the new Apertium installer for Windows on the [[Apertium guide for Windows users]]. This task requires the "Apertium on Windows (installer)" to be completed || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Light Apertium bootable ISO for small machines || Using [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz] or a similar lightweight GNU/Linux, produce the minimum-possible bootable live ISO or live USB image that contains the OS, minimum editing facilities, Apertium, and a language pair of your choice. Make sure no package that is not strictly necessary for Apertium to run is included.|| [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|- <br />
| {{sc|code}} || Apertium in XLIFF workflows || Write a shell script and (if possible, using the filter definition files found in the documentation) a filter that takes an [https://en.wikipedia.org/wiki/XLIFF XLIFF] file such as the ones representing a computer-aided translation job and populates with translations of all segments that are not translated, marking them clearly as machine-translated. || [[User:Mlforcada]] [[User:Espla]] [[User:Fsanchez]] [[User:Japerez]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up (X)HTML formatting || Sometimes, an Apertium language pair takes a valid HTML/XHTML source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of HTML/XHTML files for testing purposes. Make sure they are valid using an HTML/XHTML validator (4) translate the valid files with the language pair (5) check if the translated files are also valid HTML/XHTML files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || Investigate how orthographic modes on kk.wikipedia.org are implemented || [http://kk.wikipedia.org The Kazakh-language wikipedia] has a menu at the top for selecting alphabet (Кирил, Latın, توتە - for Cyrillic-, Latin-, and Arabic-script modes). This appears to be some sort of plugin that transliterates the text on the fly. Find out what it is and how it works, and then document it somewhere on the wiki. If this has already been documented elsewhere, point a link to that, but you still should summarise in your own words what exactly it is. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a transliteration plugin for mediawiki || Write a plugin similar in functionality (and perhaps implementation) to the way the [http://kk.wikipedia.org Kazakh-language wikipedia]'s orthography changing system works. It should be able to be directed to use any arbitrary mode from an apertium mode file installed in a pre-specified path on a server.|| [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} train tesseract on a language with no available tesseract data || Train tesseract (the OCR software) on a language that it hasn't previously been trained on. We're especially interested in languages with some coverage in apertium. We can provide images of text to train on. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|research}} || using language transducers for predictive text on Android || Investigate what it would take to add some sort of plugin to existing Android predictive text / keyboard framework(s?) that would allow the use of lttoolbox (or hfst? or libvoikko stuff?) transducers to be used to predict text and/or guess swipes (in "swype" or similar). Document your findings on the apertium wiki. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || custom predictive text keyboards for Android || Research and document on apertium's wiki the steps needed to design an application for Android that could load arbitrarily defined / pre-specified keyboard layouts (e.g., say I want to make custom keyboard layouts for [[Kumyk]] and [[Guaraní]], and load either one into the same program) as well as either an lttoolbox-format transducer or a file easily generated from one that could be paired with a keyboard layout and used to predict text in that language. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} identify 75 substitutions for conversion from colloquial Finnish to book Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to come up with 75 examples of differences between colloquial Finnish and book Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} document the correspondences between the tagset used in the RNC tagged corpus and the Apertium tagset for Russian || The Apertium tagset for Russian and the RNC tagset are different, if we were able to make correspondences between them then we could compare our output against theirs. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Disambiguate 500 words of Russian text. || The objective of this task is to disambiguate by hand 500 words of text in Russian. You can find a Wikipedia article you are interested in, or you can be assigned one, you will be given the output of a morphological analyser for Russian, and your task is to select the most adequate analysis in context. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Convert 500 words of Finnish text in colloquial Finnish to literary Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to convert 500 words of text from colloquial Finnish to literary Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Research and document what it would take to migrate from svn to git || For this task, you should research and document succinctly on the [http://wiki.apertium.org/ apertium wiki] all the issues involved in moving our entire svn repository to git. It should cover issues like preserving commit histories and tags/releases, separating repositories for each module (and what constitutes a single module), how to migrate the entire codebase (including issues of timing/logistics), replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address why a problem exists and what sorts of things could be done to remedy it (with fairly specific solutions). You do not need to worry about what a full migration strategy might look like || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Come up with a potential migration strategy for apertium to move from svn to git || For this task, you should propose a hypothetical migration strategy for apertium to move from our current svn repository to a git repository and document the proposal on the [http://wiki.apertium.org/ apertium wiki]. The proposal should address the logistics and timing issues of anything that might come up in a migration of the entire codebase, including preserving commit histories and tags/releases, separating repositories for each module, replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address how to approach each problem and where on the timeline to take care of the issue. You do not need to worry about specific solutions to the various problems. || [[User:Firespeaker]]<br />
|}<br />
<br />
[[Category:Google Code-in]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Task_ideas_for_Google_Code-in&diff=50964Task ideas for Google Code-in2014-11-17T13:56:06Z<p>Ksnmi: /* Data mangling */</p>
<hr />
<div>{{TOCD}}<br />
This is the task ideas page for [http://www.google-melange.com/gci/homepage/google/gci2014 Google Code-in], here you can find ideas on interesting tasks that will improve your knowledge of Apertium and help you get into the world of open-source development.<br />
<br />
The people column lists people who you should get in contact with to request further information. All tasks are 2 hours maximum estimated amount of time that would be spent on the task by an experienced developer, however:<br />
<br />
# '''this does not include time taken to [[Minimal installation from SVN|install]] / set up apertium'''.<br />
# this is the time expected to take by an experienced developer, you may find that you spend more time on the task because of the learning curve. <br />
<br />
<!--Если ты не понимаешь английский язык или предпочитаешь работать над русским языком или другими языками России, смотри: [[Task ideas for Google Code-in/Russian]]--><br />
'''Categories:'''<br />
<br />
* {{sc|code}}: Tasks related to writing or refactoring code<br />
* {{sc|documentation}}: Tasks related to creating/editing documents and helping others learn more<br />
* {{sc|research}}: Tasks related to community management, outreach/marketting, or studying problems and recommending solutions<br />
* {{sc|quality}}: Tasks related to testing and ensuring code is of high quality.<br />
* {{sc|interface}}: Tasks related to user experience research or user interface design and interaction<br />
<br />
You can find descriptions of some of the mentors here: [[List_of_Apertium_mentors]].<br />
<br />
==Task list==<br />
<br />
=== Misc tools ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || Unigram tagging mode for <code>apertium-tagger</code> || Edit the <code>apertium-tagger</code> code to allow for lexicalised unigram tagging. This would basically choose the most frequent analysis for each surface form of a word. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Data format for the unigram tagger || Come up with a binary storage format for the data used for the unigram tagger. It could be based on the existing <code>.prob</code> format. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Add tag combination back-off to unigram tagger. || Modify the unigram tagger to allow for back-off to tag sequence in the case that a given form is not found. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Prototype unigram tagger. || Write a simple unigram tagger in a language of your choice. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || Training for unigram tagger || Write a program that trains a model suitable for use with the unigram tagger. || [[User:Francis Tyers|Francis&nbsp;Tyers]]<br />
|-<br />
| {{sc|code}} || make voikkospell understand apertium stream format input || Make voikkospell understand apertium stream format input, e.g. ^word/analysis1/analysis2$, voikkospell should only interpret the 'word' part to be spellchecked. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make voikkospell return output in apertium stream format || make voikkospell return output suggestions in apertium stream format, e.g. ^correctword$ or ^incorrectword/correct1/correct2$ || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || libvoikko support for OS X || Make a spell server for OS X's system-wide spell checker to use arbitrary languages through libvoikko. See https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/SpellCheck/Tasks/CreatingSpellServer.html#//apple_ref/doc/uid/20000770-BAJFBAAH for more information || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Ubuntu/debian || document how to set up libreoffice voikko working with a language on Ubuntu and debian || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Fedora || document how to set up libreoffice voikko working with a language on Fedora || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on Windows || document how to set up libreoffice voikko working with a language on Windows || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document: setting up libreoffice voikko on OS X || document how to set up libreoffice voikko working with a language on OS X || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document how to set up libenchant to work with libvoikko || Libenchant is a spellchecking wrapper. Set it up to work with libvoikko, a spellchecking backend, and document how you did it. You may want to use a spellchecking module available in apertium for testing. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || firefox/iceweasel plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] .<br />
|| [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|interface}} || geriaoueg hover functionality || chrome/chromium plugin which, when enabled, allows one to hover over a word and get a pop-up; interface only. Should be something like [http://www.bbc.co.uk/apps/nr/vocab/cy-en/www.bbc.co.uk/newyddion/] or [http://lingro.com] . || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || firefox/iceweasel plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg language/pair selection || chrome/chromium plugin which queries apertium API for available languages and allows the user to set the language pair in preferences || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || firefox/iceweasel plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]]<br />
|-<br />
| {{sc|code}} || geriaoueg lookup code || chrome/chromium plugin which queries apertium API for a word by sending a context (±n words) and the position of the word in the context and gets translation for language pair xxx-yyy || [[User:Francis Tyers]] [[user:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|quality}} || make apertium-quality work with python3.3 on all platforms || migrate apertium-quality away from distribute to newer setup-tools so it installs correctly in more recent versions of python (known incompatible: python3.3 OS X, known compatible: MacPorts python3.2) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || Get bible aligner working (or rewrite it) || trunk/apertium-tools/bible_aligner.py - Should take two bible translations and output a tmx file with one verse per entry. There is a standard-ish plain-text bible translation format that we have bible translations in, and we have files that contain the names of verses of various languages mapped to English verse names || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || tesseract interface for apertium languages || Find out what it would take to integrate apertium or voikkospell into tesseract. Document thoroughly available options on the wiki. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|code}} || Syntax tree visualisation using GNU bison || Write a program which reads a grammar using bison, parses a sentence and outputs the syntax tree as text, or graphViz or something. Some example bison code can be found [https://svn.code.sf.net/p/apertium/svn/branches/transfer4 here]. || [[User:Francis Tyers]] [[User:Mlforcada]]<br />
|-<br />
| {{sc|code}} || make concordancer work with output of analyser || Allow [http://pastebin.com/raw.php?i=KG8ydLPZ spectie's concordancer] to accept an optional apertium mode and directory (implement via argparse). When it has these, it should run the corpus through that apertium mode and search against the resulting tags and lemmas as well as the surface forms. E.g., the form алдым might have the analysis via an apertium mode of ^алдым/алд{{tag|n><px1sg}}{{tag|nom}}/ал{{tag|v><tv}}{{tag|ifi><p1}}{{tag|sg}}, so a search for "px1sg" should bring up this word. || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || convert a current transducer for a language using lexc+twol to a guesser || Figure out how to generate a guesser for a language module that uses lexc for morphotactics and twol for morphophonology (e.g., apertium-kaz). One approach to investigate would be to generate all the possible archiphoneme representations of a given form and run the lexc guesser on that. || [[User:Firespeaker]] [[User:Flammie]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in hfst || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in HFST. The script should take a language code and create a new directory with a minimal lexc file, a minimal twol file, and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium module in lttoolbox || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium language module written in lttoolbox. The script should take a language code and create a new directory with a minimal dix file and a minimal rlx file, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || script to bootstrap an apertium bilingual module || Write a script (preferably in python3 or bash/equivalent) that creates a vanilla apertium bilingual module. The script should take two language codes and create a new directory with a minimal dix file, a minimal lrx file, and minimal transfer (.t*x) files, and have the minimum resources otherwise to compile and run (autoconfig stuff; autogen.sh; modes.xml; README, AUTHORS, COPYING files; etc). || [[User:Firespeaker]], [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Write a script to explain an Apertium machine translation in terms of its parts || Write a script (preferably in python3 or bash/equivalent) that takes one text segment ''S'', applies a given Apertium system to it and to all its possible whole-word subsegments ''s'' (perhaps up to a certain maximum length) and outputs a list ''(s,t,i,j,k,l)'' of correspondences so that the result of applying Apertium to ''s'' is ''t'', ''t'' is a whole-word subsegment of ''T'', the Apertium translation of ''S'', ''i'' and ''j'' are the starting position and end position of ''s'' in ''S'' and ''k'' and ''l'' are hte starting position and the end postion of ''t'' in ''T''. The script should read ''S'', ''T'', two language codes and optionally a maximum length and generate the correspondences ''(s,t,i,j,k,l)'' one per line || [[User:mlforcada]]<br />
|}<br />
<br />
=== Website and apy ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || apertium-apy mode for geriaoueg (biltrans in context) || apertium-apy function that accepts a context (e.g., ±n ~words around word) and a position in the context of a word, gets biltrans output on entire context, and returns translation for the word || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || SSL/HTTPS for Apertium.org || The Apertium site itself is equipped with SSL. Get Piwik working on HTTPS as well. After that, default to the HTTPS site via Apache. See [http://sourceforge.net/p/apertium/tickets/41/ ticket 41] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Website translation in [[Html-tools]] (code) || Html-tools should detect when the user wants to translate a website (similar to how Google Translate does it) and switch to an interface (See "Website translation in [[Html-tools]] (interface)" task) and perform the translation. It should also make it so that new pages that the user navigates to are translated. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|interface}} || Website translation in [[Html-tools]] (interface) || Add an interface to Html-tools that shows a webpage in an <iframe> with translation options and a back button to return to text/document translation. See [http://sourceforge.net/p/apertium/tickets/50/ ticket 50] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] crashing on iPads when copying text || Make it so that the Apertium site does not crash on iPads when copying text on any of the modes while maintaining semantic HTML. This task requires having access to an iPad. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Fix [[Html-tools]] copying text on Windows Phone IE || Make it so that the Apertium site allows copying text on WP while maintaining semantic HTML. This task requires having access to an Windows Phone. See [http://sourceforge.net/p/apertium/tickets/42 ticket 42] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[APY]] API keys || Add API key support but don't overengineer it. See [http://sourceforge.net/p/apertium/tickets/31/ ticket 31] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Xavivars]] <br />
|-<br />
| {{sc|code}} || Localisation of tag attributes on [[Html-tools]] || The meta description tag isn't localized as of now since the text is an attribute. Search engines often display this as their snippet. A possible way to achieve this is using data-text="@content@description". See [http://sourceforge.net/p/apertium/tickets/29/ ticket 29] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] font issues || See [http://sourceforge.net/p/apertium/tickets/27/ ticket 27] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Detect target language || When changing the source language, the [[Html-tools]] UI will often show a bunch of greyed out buttons, and the user has to fish for possible languages in the right-hand side drop-down. This is confusing (user might think "are there no languages to translate into?") and annoying. A simple solution is to reorder the list so that all possible target languages are shown first, then the list of greyed-out languages. See [http://sourceforge.net/p/apertium/tickets/25/ ticket 25] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Maintaining order of user interactions on [[Html-tools]] || If a user clicks a new language choice while translation or detection is proceeding (AJAX callback has not yet returned), the original action will not be cancelled. Make it so that the first action is canceled and overridden by the second. See [http://sourceforge.net/p/apertium/tickets/9/ ticket 9] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || Drag-n-drop file translation on [[Html-tools]] || See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || More file formats for [[APY]] || APY does not support DOC, XLS, PPT file translation that require the file being converted to the newer XML based formats through LibreOffice or equivalent and then back. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Improved file translation functionality for [[APY]] || APY needs logging and to be non-blocking for file translation. See [http://sourceforge.net/p/apertium/tickets/7/ ticket 7] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|interface}} || Abstract the formatting for the [[Html-tools]] interface. || The Html-tools interface should be easily customisable so that people can make it look how they want. The task is to abstract the formatting and make one or more new stylesheets to change the appearance. This is basically making a way of "skinning" the interface. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|interface}} || [[Html-tools]] spell-checker interface || Add an enableable spell-checker module to the [[html-tools]] interface. Get fancy with jquery/etc. so that e.g., misspelled words are underlined in red and recommendations for each word are given in some sort of drop-down menu. Feel free to implement a dummy function for testing spelling to test the interface until the "Html-tools spell-checker code" task is complete. There is a half-done version available from last year that may just need to be cleaned up and integrated into the current html-tools code. See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[Html-tools]] spell-checker code || Add code to the [[html-tools]] interface that allows spell checking to be performed. Should send entire string, and be able to match each returned result to its appropriate input word. Should also update as new words are typed (but [https://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-tools/apertium-html-tools/assets/js/translator.js#l42 not on every keystroke]). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || [[libvoikko]] support for [[APY]] || Write a function for [[APY]] that checks the spelling of an input string and for each word returns whether the word is correct, and if unknown returns suggestions. Whether segmentation is done by the client or by apertium-apy will have to be figured out. You will also need to add scanning for spelling modes to the initialisation section. Try to find a sensible way to structure the requests and returned data with JSON. Add a switch to allow someone to turn off support for this (use argparse set_false). See [http://sourceforge.net/p/apertium/tickets/6/ ticket 6] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] <br />
|-<br />
| {{sc|code}} || [[Html-tools]] expanding textareas || The input textarea in the html-tools translation interface does not expand depending on the user's input even when there is significant whitespace remaining on the page. Improvements include varying the length of the textareas to fill up the viewport or expanding depending on input. Both the input and output textareas would have to maintain the same length for interface consistency. Different behavior may be desired on mobile. See [http://sourceforge.net/p/apertium/tickets/4/ ticket 4] for details and progress tracking. || [[User:Firespeaker]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Performance tracking in [[APY]] || Add a way for [[APY]] to keep track of number of words in input and time between sending input to a pipeline and receiving output, for the last n (e.g., 100) requests, and write a function to return the average words per second over something<n (e.g., 10) requests. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Make [[APY]] use one lock per pipeline || Make [[APY]] use one lock per pipeline, since we don't need to wait for mk-en just because sme-nob is running. || [[User:Firespeaker]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || Language variant picker in [[Html-tools]] || Displaying language variants as distinct languages in the translator language selector is awkward and repetitive. Allowing users to first select a language and then display radio buttons for choosing a variant below the relevant translation box, if relevant, provides a better user interface. See [http://sourceforge.net/p/apertium/tickets/1/ ticket 1] for details and progress tracking. || [[User:Firespeaker]] [[User:Unhammer]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Investigate how to implement HTML-translation that can deal with broken HTML || The old Apertium website had a 'surf-and-translate' feature, but it frequently broke on badly-behaved HTML. Investigate how similar web sites deal with broken HTML when rewriting the internal content of a (possible automatically generated) HTML page. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Add permalink capability for generation and analysis [[Html-tools]] || [[Html-tools]] currently has support for permalinks to various translation modes. For this task, you should add similar support for analysis and generation modes. I.e., a person should be able to simply send someone a link for e.g., the Kazakh morphological analyser. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Pair visualisations ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || fix pairviewer's 2- and 3-letter code conflation problems || [[pairviewer]] doesn't always conflate languages that have two codes. E.g. sv/swe, nb/nob, de/deu, da/dan, uk/ukr, et/est, nl/nld, he/heb, ar/ara, eus/eu are each two separate nodes, but should instead each be collapsed into one node. Figure out why this isn't happening and fix it. Also, implement an algorithm to generate 2-to-3-letter mappings for available languages based on having the identical language name in languages.json instead of loading the huge list from codes.json; try to make this as processor- and memory-efficient as possible. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || map support for pairviewer ("pairmapper") || Write a version of [[pairviewer]] that instead of connecting floating nodes, connects nodes on a map. I.e., it should plot the nodes to an interactive world map (only for languages whose coordinates are provided, in e.g. GeoJSON format), and then connect them with straight-lines (as opposed to the current curved lines). Use an open map framework, like [http://leafletjs.com leaflet], [http://polymaps.org polymaps], or [http://openlayers.org openlayers] || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || coordinates for Mongolic languages || Using the map [https://en.wikipedia.org/wiki/File:Linguistic_map_of_the_Mongolic_languages.png Linguistic map of the Mongolic languages.png], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format that can be loaded by pairmapper (or, e.g., converted to kml and loaded in google maps). The file should contain points that are a geographic "center" (locus) for where each Mongolic language on that map is spoken. Use the term "Khalkha" (iso 639-3 khk) for "Mongolisch", and find a better map for Buryat. You can use a capital city for bigger, national languages if you'd like (think Paris as a locus for French). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || draw languages as areas for pairmapper || Make a map interface that loads data (in e.g. GeoJSON or KML format) specifying areas where languages are spoken, as well as a single-point locus for the language, and displays the areas on the map (something like [http://leafletjs.com/examples/choropleth.html the way the states are displayed here]) with a node with language code (like for [[pairviewer]]) at the locus. This should be able to be integrated into pairmapper, the planned map version of pairviewer. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Tatar, Bashqort, and Chuvash || Using the maps listed here, try to define rough areas for where Tatar, Bashqort, and Chuvash are spoken. These areas should be specified in a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. Try to be fairly accurate and detailed. Maps to consult include [https://commons.wikimedia.org/wiki/File:Tatarbashkirs1989ru.PNG Tatarsbashkirs1989ru], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP] || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus Turkic languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Kumyk, Nogay, Karachay, Balkar. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for IE and Mongolic Caucasus-area languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Ossetian, Armenian, Kalmyk. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for North Caucasus languages || Using the map [https://commons.wikimedia.org/wiki/File:Caucasus-ethnic_en.svg Caucasus-ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the area(s) the following languages are spoken in: Avar, Chechen, Abkhaz, Georgian. There should be a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Kazakh || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Kazakh is spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference language areas for Central Asian languages: Uzbek and Uyghur || Using the map [https://commons.wikimedia.org/wiki/File:Central_Asia_Ethnic_en.svg Central_Asia_Ethnic_en.svg], write a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin. The file should contain specifications for the areas Uzbek and Uyghur are spoken in, with a certain level of detail (e.g., don't just make a shape matching Kazakhstan for Kazakh) and accuracy (i.e., don't just put a square over Kazakhstan and call it the area for Kazakh). || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || georeference areas Russian is spoken || Assume areas in Central Asia with any sort of measurable Russian population speak Russian. Use the following maps to create a file in [https://en.wikipedia.org/wiki/GeoJSON GeoJSON] (or similar) format for use by pairmapper's languages-as-areas plugin: [https://commons.wikimedia.org/wiki/File:Kazakhstan_European_2012_Rus.png Kazakhstan_European_2012_Rus], [https://commons.wikimedia.org/wiki/File:Ethnicrussians1989ru.PNG Ethnicrussians1989ru], [https://commons.wikimedia.org/wiki/File:Lenguas_eslavas_orientales.PNG Lenguas_eslavas_orientales], [https://commons.wikimedia.org/wiki/File:NarodaCCCP.jpg NarodaCCCP]. Try to cover all the areas where Russian is spoken at least as a major language. || [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || split nor into nob and nno in pairviewer || Currently in [[pairviewer]], nor is displayed as a language separately from nob and nno. However, the nor pair actually consists of both an nob and an nno component. Figure out a way for pairviewer (or pairsOut.py / get_all_lang_pairs.py) to detect this split. So instead of having swe-nor, there would be swe-nob and swe-nno displayed (connected seemlessly with other nob-* and nno-* pairs), though the paths between the nodes would each still give information about the swe-nor pair. Implement a solution, trying to make sure it's future-proof (i.e., will work with similar sorts of things in the future). || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}}, {{sc|code}} || add support to pairviewer for regional and alternate orthograpic modes || Currently in [[pairviewer]], there is no way to detect or display modes like zh_TW. Add suppor to pairsOut.py / get_all_lang_pairs.py to detect pairs containing abbreviations like this, as well as alternate orthographic modes in pairs (e.g. uzb_Latn and uzb_Cyrl). Also, figure out a way to display these nicely in the pairviewer's front-end. Get creative. I can imagine something like zh_CN and zh_TW nodes that are in some fixed relation to zho (think Mickey Mouse configuration?). Run some ideas by your mentor and implement what's decided on. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|}<br />
<br />
=== Begiak ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|quality}} || Generalise phenny/begiak git plugin || Rename the module to git (instead of github), and test it to make sure it's general enough for at least three common git services (should already be supported, but double check) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin commit info function || Add a function to get the status of a commit by reponame and name (similar to what the svn module does), and then find out why commit 6a54157b89aee88511a260a849f104ae546e3a65 in turkiccorpora resulted in the following output, and fix it: Something went wrong: dict_keys(['commits', 'user', 'canon_url', 'repository', 'truncated']) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin recent function || Find out why the recent function (begiak: recent) returns "ValueError: No JSON object could be decoded (file "/usr/lib/python3.2/json/decoder.py", line 371, in raw_decode)" for one of the repos (no permission) and find a way to fix it so it returns the status instead. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak git plugin status || Add a function that lets anyone (not just admin) get the status of the git event server. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || Document phenny/begiak git plugin || Document the module: how to use it with each service it supports, and the various ways the module can be interacted with (by administrators and anyone) || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || phenny/begiak svn plugin info function || Find out why the info function ("begiak info [repo] [rev]") doesn't work and fix it. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || document any phenny/begiak command that does not have information || Find a command that our IRC bot uses that is not documented, and document how it works both on the [http://wiki.apertium.org/wiki/Begiak Begiak wiki page] and in the code. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count rlx sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in rlx files and output that to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to count t*x sizes || Make the awikstats module of our IRC bot ([[begiak]]) count the number of rules in all .t*x files (for language pairs) and output the sum to the relevant statistics page. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak awikstats to report the revision of each monolingual file || Make the awikstats module of our IRC bot ([[begiak]]) report each file's svn revision for pairs with their own monodices, e.g. [[Apertium-en-es/stats]]. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue to support nick aliases || Make the tell/ask queue function of our IRC bot ([[begiak]]) support alises for nicks, so that e.g. spectre/spectie/spectei can get tell messages regardless of which nick they were sent to. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue support deleting items from queue || Allow a user who added something to the tell/ask queue of our IRC bot ([[begiak]]) to display a list of the messages s/he has queued and delete one of them. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak tell/ask queue split long messages || Make our IRC bot ([[begiak]])'s tell/ask function split overly long messages into multiple ones for display so as to not exceed the max IRC message length. This will require you to fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak blacklist for url interceptor || Modify our IRC bot ([[begiak]])'s url interceptor module so that an optional blacklist (list of url regexes?) can be provided in the config file. The point is to make it not display titles for site urls we might copy/paste a lot and/or that are known not to provide useful information. An example might be ^http(s?)://svn.code.sf.net/ . For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak relevant wiki module handle urls for wikis || Make our IRC bot ([[begiak]])'s url interceptor check whether a url is a link to a known mediawiki site (wikipedia, wiktionary, apertium wiki) and redirect to the appropriate module. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak apertium wiki module search capability || Have our IRC bot ([[begiak]])'s awik plugin search the apertium wiki and return top hit if a page isn't found (like the wikipedia plugin). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || phenny/begiak wiki modules tell result || Make a function for our IRC bot ([[begiak]]) that allows someone to point another user to a wiki page (apertium wiki or wikipedia), and have it give them the results (e.g. for mentors to point students to resources). It could be an extra function on the .wik and .awik modules. Make sure it allows for all wiki modes in those modules (e.g., .wik.ru) and is intuitive to use. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|quality}} || find content that phenny/begiak wiki modules don't do a good job with || Identify at least 10 pages or sections on Wikipedia or the apertium wiki that the respective [[begiak]] module doesn't return good output for. These may include content where there's immediately a subsection, content where the first thing is a table or infobox, or content where the first . doesn't end the sentence. Document generalisable scenarios about what the preferred behaviour would be. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || write a mailing list reporter for phenny/begiak || Write a module for our IRC bot ([[begiak]]) that either polls mailing list archives or is triggered by email being sent to a local account. The idea is to have begiak report a short IRC-message-length summary when someone posts to one of our publicly-visible mailing lists, like apertium-stuff or apertium-turkic lists. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make phenny/begiak git and svn modules display urls || When a user asks to display revision information, have [[begiak]] (our IRC bot) include a link to information on the revision. For example, when displaying information for apertium repo revision r57171, include the url http://sourceforge.net/p/apertium/svn/57171/ , maybe even a shortened version. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || greeting function for phenny/begiak || Write a module that has [[begiak]] (our IRC bot) keep track of users, and when a user it hasn't seen before enters a channel it's monitoring, have it greet them with a custom message, such as "Welcome to #apertium, (user)! Please stick around for a while and someone will address any questions you have." You'll have to keep track of users for each channel, and you should make the message enablable by channel. Also, allow a user-specific greeting to be enabled (e.g., for the ap-vbox user). For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || fix phenny/begiak seen function || When begiak is restart, the <tt>.seen</tt> command forgets when it's seen everyone. Have the module save the relevant information as needed to a database (using standard phenny methods) that gets reloaded when the module is loaded on a restart of the bot. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || improve phenny/begiak timezone data || Find a source of standard timezone abbreviations and have the time module for [[begiak]] (our IRC bot) scrape and use that data. You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add support for timezone conversion to phenny/begiak || Add timezone conversion to the time plugin for [[begiak]] (our IRC bot). It should accept a time in one timezone and a destination timezone, and convert the time, e.g. ".tz 335EST in CET" should return "835CET". For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add city name support phenny/begiak timezone plugin || Find a source that maps city names to timezone abbreviations and have the .tz command for [[begiak]] (our IRC bot) scrape and use that data (e.g., ".time Barcelona" should give the current time in CET). You might want to model the scraper and storage after the .iso639 db scraper. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || add analysis and generation modes to apertium translation begiak module || Add the ability for the apertium translation module that's part [[begiak]] (our IRC bot) to query morphological analysis and generation modes. For this task, you should fork [https://github.com/jonorthwash/phenny the bot on github] and send a pull request when you're done. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || make begiak's version control monitoring channel specific || Our IRC bot ([[begiak]]) currently monitors a series of git and svn repositories. When a commit is made to a repository, the bot displays the commit in all channels. For this task, you should modify both of these modules (svn and git) so that repositories being monitored (listed in the config file) can be specified in a channel-specific way. However, it should default to the current behaviour—channel-specific settings should just override the global monitoring pattern. You should fork [https://github.com/jonorthwash/phenny the bot on github] to work on this task and send a pull request when you're done. || [[User:Firespeaker]]<br />
|}<br />
<br />
=== Apertium linguistic data ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the bilingual dictionary of a language pair XX-YY in the incubator by adding 50 word correspondences to it || Languages XX and YY may have rather large dictionaries but a small bilingual dictioanry. Add words to the bilingual dictionary and test that the new vocabulary works. [[/Grow bilingual|Read more]]... || [[User:Mlforcada]] <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair XX-YY by adding 50 words to its vocabulary || Add words to language pair XX-YY and test that the new vocabulary works. [[/Add words|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Xavivars]] [[User:Bech]] [[User:Jimregan|Jimregan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Japerez]] [[User:tunedal]] [[User:Juanpabl]] [[User:Youssefsan|Youssefsan]] [[User:Firespeaker]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Find translation bugs by using LanguageTool, and correct them || The LanguageTool grammar/style checker has great rule sets for Catalan. Run it on output from Apertium translation into Catalan and fix 5 mistakes. [[/Fix using LanguageTool|Read more]]... || <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Add/correct one structural transfer rule to an existing language pair || Add or correct a structural transfer rule to an existing language pair and test that it works. [[/Add transfer rule|Read more]]... || [[User:Mlforcada]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Nikant]] [[User:Fulup|Fulup]] [[User:Juanpabl]] [[User:Raveesh]]<br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 lexical selection rules for a language pair already set up with lexical selection || Add 10 lexical selection rules to improve the lexical selection quality of a pair and test them to ensure that they work. [[/Add lexical-select rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fsanchez]] [[User:Nikant]] [[User:Japerez]] [[User:Firespeaker]] [[User:Raveesh]](more mentors welcome) <br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair to use lexical selection and write 5 rules || First set up a language pair to use the new lexical selection module (this will involve changing configure scripts, makefile and [[modes]] file). Then write 5 lexical selection rules. [[/Setup and add lexical selection|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]] [[User:Fulup|Fulup]] [[User:pankajksharma]] (more mentors welcome) <br />
|-<br />
| {{sc|code}}, {{sc|quality}} || {{sc|multi}} Write 10 constraint grammar rules to repair part-of-speech tagging errors || Find some tagging errors and write 10 constraint grammar rules to fix the errors. [[/Add constraint-grammar rules|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:ilnar.salimzyan]] [[User:Unhammer]] [[User:Fulup|Fulup]] (more mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Set up a language pair such that it uses constraint grammar for part-of-speech tagging || Find a language pair that does not yet use constraint grammar, and set it up to use constraint grammar. After doing this, find some tagging errors and write five rules for resolving them. [[/Setup constraint grammar for a pair|Read more]]... || [[User:Mlforcada]], [[User:Francis Tyers]] [[User:Unhammer]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Compare Apertium with another MT system and improve it || This tasks aims at improving an Apertium language pair when a web-accessible system exists for it in the 'net. Particularly good if the system is (approximately) rule-based such as [http://www.lucysoftware.com/english/machine-translation/lucy-lt-kwik-translator-/ Lucy], [http://www.reverso.net/text_translation.aspx?lang=EN Reverso], [http://www.systransoft.com/free-online-translation Systran] or [http://www.freetranslation.com/ SDL Free Translation]: (1) Install the Apertium language pair, ideally such that the source language is a language you know (L₂) and the target language a language you use every day (L₁). (2) Collect a corpus of text (newspaper, wikipedia) Segment it in sentences (using e.g., libsegment-java or a similar processor and a [https://en.wikipedia.org/wiki/Segmentation_Rules_eXchange SRX] segmentation rule file borrowed from e.g. OmegaT) and put each sentence in a line. Run the corpus through Apertium and through the other system Select those sentences where both outputs are very similar (e.g, 90% coincident). Decide which one is better. If the other language is better than Apertium, think of what modification could be done for Apertium to produce the same output, and make 3 such modifications.|| [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} What's difficult about this language pair? || For a language pair that is not in trunk or staging such that you know well the two languages involved, write a document describing the main problems that Apertium developers would encounter when developing that language pair (for that, you need to know very well how Apertium works). Note that there may be two such documents, one for A→B and the other for B→A Prepare it in your user space in the Apertium wiki.It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Jimregan|Jimregan]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Write a contrastive grammar || Using a grammar book/resource document 10 ways in which the grammar of two languages differ, with no fewer than 3 examples of each difference. Put it on the wiki under Language1_and_Language2/Contrastive_grammar. See [[Farsi_and_English/Pending_tests]] for an example of a contrastive grammar that a previous GCI student made. || [[User:Francis Tyers]] [[User:Firespeaker]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Hand annotate 250 words of running text. || Use [[apertium annotatrix]] to hand-annotate 250 words of running text from Wikipedia for a language of your choice. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || The most frequent Romance-to-Romance transfer rules || Study the .t1x transfer rule files of Romance language pairs and distill 5-10 common rules that are common to all of them, perhaps by rewriting them into some equivalent form || [[User:Mlforcada]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Tag and align Macedonian--Bulgarian corpus || Take a Macedonian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-mk-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Bulgarian inflections || Write a program to extract Bulgarian inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Bulgarian_nouns Category:Bulgarian nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || {{sc|multi}} Improve the quality of a language pair by allowing for alternative translations || Improve the quality of a language pair by (a) detecting 5 cases where the (only) translation provided by the bilingual dictionary is not adequate in a given context, (b) adding the lexical selection module to the language, and (c) writing effective lexical selection rules to exploit that context to select a better translation || [[User:Francis Tyers]] [[User:Mlforcada]] [[User:Unhammer]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up (X)HTML formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up (X)HTML formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up wordprocessor (ODT, RTF) formatting || Sometimes, an Apertium language pair takes a valid ODT or RTF source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of ODT or RTF files for testing purposes. Make sure they are opened using LibreOffice/OpenOffice.org (4) translate the valid files with the language pair (5) check if the translated files are also valid ODT or RTF files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} {{sc|depend}} Make sure an Apertium language pair does not mess up wordprocessor (ODT, RTF) formatting || (Depends on someone having performed the task 'Examples of files where an Apertium language pair messes up wordprocessor formatting' above). The task: (1) run the file through Apertium try to identify where the tags are broken or lost: this is most likely to happen in a structural transfer step; try to identify the rule where the label is broken or lost (2) repair the rule: a conservative strategy is to make sure that all superblanks (<b pos="..."/>) are output and are in the same order as in the source file. This may involve introducing new simple blanks (<b/>) and advancing the output of the superblanks coming from the source. (3) test again (4) Submit a patch to your mentor (or commit it if you have already gained developer access) || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Start a language pair involving Interlingua || Start a new language pair involving [https://en.wikipedia.org/wiki/Interlingua Interlingua] using the [http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO Apertium new language HOWTO]. Interlingua is the second most used "artificial" language, after Esperanto). As Interlingua is basically a Romance language, you can use a Romance language as the other language, and Romance-language dictionaries rules may be easily adapted. Include at least 50 very frequent words (including some grammatical words) and at least one noun--phrase transfer rule in the ia→X direction. || [[User:Mlforcada]] [[User:Youssefsan|Youssefsan]] (will reach out also to the interlingua community) <br />
|-<br />
| {{sc|research}} || Document materials for a language not yet on our wiki || Document materials for a language not yet on our wiki. This should look something like the page on [[Aromanian]]—i.e., all available dictionaries, grammars, corpora, machine translators, etc., print or digital, where available, whether Free, etc., as well as some scholarly articles regarding the language, especially if about computational resources. || [[User:Firespeaker]] [[User:Francis Tyers]] [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Corpus Collection for Sindhi language. || (1) Collect a Sindhi monolingual corpus and tag it (some sentences). (2)Look for parallel/comparable corpusof Sindhi & (English or Hindi or Urdu or other) , clean it and mention it document materials wiki page for Sindhi. || [[User:Raveesh]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Macedonian corpus || Take a Albanian--Macedonian corpus, for example SETimes, tag it using the [[apertium-sq-mk]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Serbo-Croatian corpus || Take a Albanian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-sq-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--Bulgarian corpus || Take a Albanian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sq-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Albanian--English corpus || Take a Albanian--English corpus, for example SETimes, tag it using the [[apertium-sq-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--Serbo-Croatian corpus || Take a Macedonian--Serbo-Croatian corpus, for example SETimes, tag it using the [[apertium-mk-sh]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Macedonian--English corpus || Take a Macedonian--English corpus, for example SETimes, tag it using the [[apertium-mk-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--Bulgarian corpus || Take a Serbo-Croatian--Bulgarian corpus, for example SETimes, tag it using the [[apertium-sh-bg]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Serbo-Croatian--English corpus || Take a Serbo-Croatian--English corpus, for example SETimes, tag it using the [[apertium-sh-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Tag and align Bulgarian--English corpus || Take a Bulgarian--English corpus, for example SETimes, tag it using the [[apertium-bg-en]] pair, and word-align it using GIZA++. || [[User:Francis Tyers]] [[User:Sereni]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek noun inflections || Write a program to extract Greek inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_nouns Category:Greek nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek verb inflections || Write a program to extract Greek inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_verbs Category:Greek verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Greek adjective inflections || Write a program to extract Greek inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Greek_adjectives Category:Greek adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to convert the Giellatekno Faroese CG to Apertium tags || Write a program which converts the tagset of the Giellatekno Faroese constraint grammar. || [[User:Francis Tyers]] [[User:Trondtr]]<br />
|-<br />
| {{sc|quality}} || Import nouns from azmorph into apertium-aze || Take the nouns (excluding proper nouns) from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adjectives from azmorph into apertium-aze || Take the adjectives from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import adverbs from azmorph into apertium-aze || Take the adverbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import verbs from azmorph into apertium-aze || Take the verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|quality}} || Import misc categories from azmorph into apertium-aze || Take the categories that aren't nouns, proper nouns, adjectives, adverbs, and verbs from [https://svn.code.sf.net/p/apertium/svn/branches/azmorph https://svn.code.sf.net/p/apertium/svn/branches/azmorph] and put them into [[lexc]] format in [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze https://svn.code.sf.net/p/apertium/svn/incubator/apertium-aze]. || [[User:Firespeaker]] [[User:Francis Tyers]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--English sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain text files (eng.FILENAME.txt) and (kaz.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || Build a clean Kazakh--Russian sentence-aligned bilingual corpus for testing purposes using official information from Kazakh websites (minimum 50 bilingual sentences). || Download and align the Kazakh and English version of the same page, divide them in sentences, and build two plain files (eng.FILENAME.txt) and (rus.FILENAME.txt) with one sentence per line so that they correspond to each other. || [[User:mlforcada]] [[User:Sereni]]<br />
|}<br />
<br />
=== Data mangling ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion || Write a conversion module for an existing dictionary for apertium-dixtools. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Dictionary conversion in python || Write a conversion module for an existing free bilingual dictionary to [[lttoolbox]] format using Python. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese noun inflections || Write a program to extract Faroese inflection information for nouns from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_nouns Category:Faroese nouns] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese verb inflections || Write a program to extract Faroese inflection information for verbs from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_verbs Category:Faroese verbs] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Write a program to extract Faroese adjective inflections || Write a program to extract Faroese inflection information for adjectives from Wiktionary, see [https://en.wiktionary.org/wiki/Category:Faroese_adjectives Category:Faroese adjectives] || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || scraper for all wiktionary pages in a category || a script that returns urls of all pages in a wiktionary category recursively (e.g., http://en.wiktionary.org/wiki/Category:Bashkir_nouns should also include pages from http://en.wiktionary.org/wiki/Category:Bashkir_proper_nouns ) || [[User:Firespeaker]] [[User:Francis Tyers]] || [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Bilingual dictionary from word alignments script || Write a script which takes [[GIZA++]] alignments and outputs a <code>.dix</code> file. The script should be able to reduce the number of tags, and also have some heuristics to test if a word is too-frequently aligned. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || {{sc|multi}} Scraper for free forum content || Write a script to scrape/capture all freely available content for a forum or forum category and dump it to an xml corpus file or text file. || [[User:Firespeaker]] || [[User:Ksnmi]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} scrape a freely available dictionary using tesseract || Use tesseract to scrape a freely available dictionary that exists in some image format (pdf, djvu, etc.). Be sure to scrape grammatical information if available, as well stems (e.g., some dictionaries might provide entries like АЗНА·Х, where the stem is азна), and all possible translations. Ideally it should dump into something resembling [[bidix]] format, but if there's no grammatical information and no way to guess at it, some flat machine-readable format is fine. || [[User:Firespeaker]] [[User:Francis Tyers]] || [[User:Ksnmi]]<br />
|-<br />
| {{sc|code}} || Write an aligner for UDHR || Write a script to align two translations of the [[UDHR]] (final destination: trunk/apertium-tools/udhr_aligner.py). It should take two UDHR translations and output a tmx file with one article per entry. It should use the xml formatted UDHRs available from [http://www.unicode.org/udhr/index_by_name.html http://www.unicode.org/udhr/index_by_name.html] as input and output the aligned texts in tmx format. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || script to generate dictionary from IDS data || Write a script that takes two lg_id codes, scrapes those dictionaries at [http://lingweb.eva.mpg.de/ids/ IDS], matches entries, and outputs a dictionary in [[bidix]] format || [[User:Francis Tyers]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Script to convert rapidwords dictionary to apertium bidix || Write a script (preferably in python3) that converts an arbitrary dictionary from [http://rapidwords.net/reports rapidwords.net] to apertium bidix format. Keep in mind that rapidwords dictionaries may contain more than two languages, while apertium dictionaries may only contain two languages, so the script should take an argument allowing the user to specify which languages to extract. Ideally, there should also be an argument that lists the languages available. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Script to convert simple bilingual dictionary entries to lttoolbox-style entries || Write a simple converter for lists of bilingual dictionary entries (one per line) so that one can use the shorthand notation <code>perro.n.m:dog.n</code> to generate lttoolbox-style entries of the form <code><e><l>perro<s n="n"/><s n="m"/></l><r>dog<s n="n"/></r></e></code>. You may start from [https://github.com/jimregan/internostrum-to-lttoolbox] if you wish. || [[User:mlforcada]]<br />
|}<br />
<br />
=== Misc ===<br />
{|class="wikitable sortable"<br />
! Category !! Title !! Description !! Mentors<br />
|-<br />
| {{sc|documentation}} || Installation instructions for missing GNU/Linux distributions or versions || Adapt installation instructions for a particular GNU/Linux or Unix-like distribution if the existing instructions in the Apertium wiki do not work or have bugs of some kind. Prepare it in your user space in the Apertium wiki. It may be uploaded to the main wiki when approved. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Installing Apertium in lightweight GNU/Linux distributions || Give instructions on how to install Apertium in one of the small or lightweight GNU/Linux distributions such as [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz], so that may be used in older machines || [[User:Mlforcada]] [[User:Bech]] [[User:Youssefsan|Youssefsan]] (alternative mentors welcome) <br />
|-<br />
| {{sc|documentation}} || Video guide to installation || Prepare a screencast or video about installing Apertium; make sure it uses a format that may be viewed with Free software. When approved by your mentor, upload it to youtube, making sure that you use the HTML5 format which may be viewed by modern browsers without having to use proprietary plugins such as Adobe Flash. || [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Apertium in 5 slides || Write a 5-slide HTML presentation (only needing a modern browser to be viewed and ready to be effectively "karaoked" by some else in 5 minutes or less: you can prove this with a screencast) in the language in which you write more fluently, which describes Apertium, how it works, and what makes it different from other machine translation systems. || [[User:Mlforcada]] [[User:Firespeaker]] [[User:Japerez]] (alternative mentors welcome)<br />
|-<br />
| {{sc|documentation}} || Improved "Become a language-pair developer" document || Read the document [[Become_a_language_pair_developer_for_Apertium]] and think of ways to improve it (don't do this if you have not done any of the language pair tasks). Send comments to your mentor and/or repare it in your user space in the Apertium wiki. There will be a chance to change the document later in the Apertium Wiki. || [[User:Mlforcada]] [[User:Bech]] [[User:Firespeaker]]<br />
|-<br />
| {{sc|documentation}} || An entry test for Apertium || Write 20 multiple-choice questions about Apertium. Each question will give 3 options of which only one is true, so that we can build an "Apertium exam" for future GSoC/GCI/developers. Optionally, add an explanation for the correct answer. || [[User:Mlforcada]] [[User:Japerez]]<br />
|-<br />
| {{sc|code}} || Apertium on Windows (installer) || Make an Apertium installer for Windows; it should at least support Windows 7/8 (x86 and x86-64). Remember to check in the source to SVN and make it easily upgradeable. Adding language pairs should also not be difficult. See the current (non-functional) [[Apertium guide for Windows users]] for inspiration. || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|documentation}} || Apertium on Windows (docs) || Document the new Apertium installer for Windows on the [[Apertium guide for Windows users]]. This task requires the "Apertium on Windows (installer)" to be completed || [[User:Francis Tyers]]<br />
|-<br />
| {{sc|code}} || Light Apertium bootable ISO for small machines || Using [https://en.wikipedia.org/wiki/Damn_Small_Linux Damn Small Linux] or [https://en.wikipedia.org/wiki/SliTaz_GNU/Linux SliTaz] or a similar lightweight GNU/Linux, produce the minimum-possible bootable live ISO or live USB image that contains the OS, minimum editing facilities, Apertium, and a language pair of your choice. Make sure no package that is not strictly necessary for Apertium to run is included.|| [[User:Mlforcada]] [[User:Firespeaker]] (alternative mentors welcome)<br />
|- <br />
| {{sc|code}} || Apertium in XLIFF workflows || Write a shell script and (if possible, using the filter definition files found in the documentation) a filter that takes an [https://en.wikipedia.org/wiki/XLIFF XLIFF] file such as the ones representing a computer-aided translation job and populates with translations of all segments that are not translated, marking them clearly as machine-translated. || [[User:Mlforcada]] [[User:Espla]] [[User:Fsanchez]] [[User:Japerez]] (alternative mentors welcome)<br />
|- <br />
| {{sc|quality}} || Examples of minimum files where an Apertium language pair messes up (X)HTML formatting || Sometimes, an Apertium language pair takes a valid HTML/XHTML source file but delivers an invalid HTML/XHTML target file, regardless of translation quality. This can usually be blamed on incorrect handling of superblanks in structural transfer rules. The task: (1) select a language pair (2) Install Apertium locally from the Subversion repository; install the language pair; make sure that it works (3) download a series of HTML/XHTML files for testing purposes. Make sure they are valid using an HTML/XHTML validator (4) translate the valid files with the language pair (5) check if the translated files are also valid HTML/XHTML files; select those that aren't (6) find the first source of non-validity and study it, and strip the source file until you just have a small (valid!) source file with some text around the minimum possible example of problematic tags; save each such file and describe the error. || [[User:Mlforcada]] (alternative mentors welcome)<br />
|-<br />
| {{sc|research}} || Investigate how orthographic modes on kk.wikipedia.org are implemented || [http://kk.wikipedia.org The Kazakh-language wikipedia] has a menu at the top for selecting alphabet (Кирил, Latın, توتە - for Cyrillic-, Latin-, and Arabic-script modes). This appears to be some sort of plugin that transliterates the text on the fly. Find out what it is and how it works, and then document it somewhere on the wiki. If this has already been documented elsewhere, point a link to that, but you still should summarise in your own words what exactly it is. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|code}} || Write a transliteration plugin for mediawiki || Write a plugin similar in functionality (and perhaps implementation) to the way the [http://kk.wikipedia.org Kazakh-language wikipedia]'s orthography changing system works. It should be able to be directed to use any arbitrary mode from an apertium mode file installed in a pre-specified path on a server.|| [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} train tesseract on a language with no available tesseract data || Train tesseract (the OCR software) on a language that it hasn't previously been trained on. We're especially interested in languages with some coverage in apertium. We can provide images of text to train on. || [[User:Firespeaker]] <br />
|-<br />
| {{sc|research}} || using language transducers for predictive text on Android || Investigate what it would take to add some sort of plugin to existing Android predictive text / keyboard framework(s?) that would allow the use of lttoolbox (or hfst? or libvoikko stuff?) transducers to be used to predict text and/or guess swipes (in "swype" or similar). Document your findings on the apertium wiki. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || custom predictive text keyboards for Android || Research and document on apertium's wiki the steps needed to design an application for Android that could load arbitrarily defined / pre-specified keyboard layouts (e.g., say I want to make custom keyboard layouts for [[Kumyk]] and [[Guaraní]], and load either one into the same program) as well as either an lttoolbox-format transducer or a file easily generated from one that could be paired with a keyboard layout and used to predict text in that language. || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} identify 75 substitutions for conversion from colloquial Finnish to book Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to come up with 75 examples of differences between colloquial Finnish and book Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|documentation}} || {{sc|multi}} document the correspondences between the tagset used in the RNC tagged corpus and the Apertium tagset for Russian || The Apertium tagset for Russian and the RNC tagset are different, if we were able to make correspondences between them then we could compare our output against theirs. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Disambiguate 500 words of Russian text. || The objective of this task is to disambiguate by hand 500 words of text in Russian. You can find a Wikipedia article you are interested in, or you can be assigned one, you will be given the output of a morphological analyser for Russian, and your task is to select the most adequate analysis in context. || [[User:Francis Tyers]] [[User:Beboppinbobby]] [[User:Sereni]]<br />
|-<br />
| {{sc|research}} || {{sc|multi}} Convert 500 words of Finnish text in colloquial Finnish to literary Finnish || Colloquial Finnish can be written and pronounced differently to book Finnish (e.g. "ei oo" = "ei ole"; "mä oon" = "minä olen"). The objective of this task is to convert 500 words of text from colloquial Finnish to literary Finnish. || [[User:Francis Tyers]] [[User:Inariksit]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Research and document what it would take to migrate from svn to git || For this task, you should research and document succinctly on the [http://wiki.apertium.org/ apertium wiki] all the issues involved in moving our entire svn repository to git. It should cover issues like preserving commit histories and tags/releases, separating repositories for each module (and what constitutes a single module), how to migrate the entire codebase (including issues of timing/logistics), replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address why a problem exists and what sorts of things could be done to remedy it (with fairly specific solutions). You do not need to worry about what a full migration strategy might look like || [[User:Firespeaker]]<br />
|-<br />
| {{sc|research}}, {{sc|documentation}} || Come up with a potential migration strategy for apertium to move from svn to git || For this task, you should propose a hypothetical migration strategy for apertium to move from our current svn repository to a git repository and document the proposal on the [http://wiki.apertium.org/ apertium wiki]. The proposal should address the logistics and timing issues of anything that might come up in a migration of the entire codebase, including preserving commit histories and tags/releases, separating repositories for each module, replacing all the links to svn on our wiki with links to git, rewriting documentation for users, and anything else you can think of. Each point should address how to approach each problem and where on the timeline to take care of the issue. You do not need to worry about specific solutions to the various problems. || [[User:Firespeaker]]<br />
|}<br />
<br />
[[Category:Google Code-in]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50705PMC election/20142014-10-20T11:08:41Z<p>Ksnmi: /* Calendar */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Final List of Candidates for President and PMC: October 19<br />
<br />
* Elections run from October 20 until October 27.<br />
<br />
* Results proclaimed: October 30.<br />
<br />
* New PMC in office: November 1<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|-<br />
| 36 || sereni || Ekaterina Ageeva<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|- <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || xavivars || Xavi Ivars || <br />
|- <br />
| 3 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| 1 || mlforcada || Mikel L. Forcada || --[[User:Mlforcada|Mlforcada]] ([[User talk:Mlforcada|talk]]) 18:05, 11 October 2014 (CEST) -<br />
|-<br />
| 2 || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|-<br />
|4 || mlforcada || Mikel L. Forcada || 07:15, 22 September 2014 (CEST)<br />
|-<br />
|5 || nordfalk || Jacob Nordfalk || 07:24, 22 September 2014 (CEST)<br />
|-<br />
|6 || unhammer || Kevin Brubeck Unhammer || 09:31, 22 September 2014 (CEST)<br />
|-<br />
|7 || sanmarf || Felipe Sánchez Martínez || 17:05, 6 October 2014 (CEST)<br />
|-<br />
|8 || bechapertium || Bernard Chardonneau || 21:58, 6 October 2014 (CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50625PMC election/20142014-10-10T03:29:10Z<p>Ksnmi: /* Census */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|-<br />
| 36 || sereni || Ekaterina Ageeva<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|- <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || xavivars || Xavi Ivars || <br />
|- <br />
| 3 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|-<br />
|4 || mlforcada || Mikel L. Forcada || 07:15, 22 September 2014 (CEST)<br />
|-<br />
|5 || nordfalk || Jacob Nordfalk || 07:24, 22 September 2014 (CEST)<br />
|-<br />
|6 || unhammer || Kevin Brubeck Unhammer || 09:31, 22 September 2014 (CEST)<br />
|-<br />
|7 || sanmarf || Felipe Sánchez Martínez || 17:05, 6 October 2014 (CEST)<br />
|-<br />
|8 || bechapertium || Bernard Chardonneau || 21:58, 6 October 2014 (CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50225PMC election/20142014-09-23T02:53:08Z<p>Ksnmi: /* Census */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|-<br />
| 36 || sereni || Ekaterina Ageeva<br />
|-<br />
| 37 || eltorre || Daniel Torregrosa Rivero<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|- <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || xavivars || Xavi Ivars || <br />
|- <br />
| 3 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|-<br />
|4 || mlforcada || Mikel L. Forcada || 07:15, 22 September 2014 (CEST)<br />
|-<br />
|5 || nordfalk || Jacob Nordfalk || 07:24, 22 September 2014 (CEST)<br />
|-<br />
|6 || unhammer || Kevin Brubeck Unhammer || 09:31, 22 September 2014 (CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50222PMC election/20142014-09-22T05:26:56Z<p>Ksnmi: /* PMC */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|-<br />
| 36 || sereni || Ekaterina Ageeva<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|- <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || xavivars || Xavi Ivars || <br />
|- <br />
| 3 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|-<br />
|4 || mlforcada || Mikel L. Forcada || 07:15, 22 September 2014 (CEST)<br />
|-<br />
|5 || nordfalk || Jacob Nordfalk || 07:24, 22 September 2014 (CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50219PMC election/20142014-09-22T04:06:37Z<p>Ksnmi: /* Census */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|-<br />
| 36 || sereni || Ekaterina Ageeva<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|- <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || xavivars || Xavi Ivars || <br />
|- <br />
| 3 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50218PMC election/20142014-09-22T03:31:05Z<p>Ksnmi: /* Election board */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|- <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || xavivars || Xavi Ivars || <br />
|- <br />
| 3 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50217PMC election/20142014-09-22T03:30:28Z<p>Ksnmi: /* Election board */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|- <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|- <br />
| 3 || xavivars || Xavi Ivars || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50216PMC election/20142014-09-22T03:30:10Z<p>Ksnmi: /* Election board */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
| 1 || ksnmi|| Akshay Minocha || <br />
|-<br />
| 2 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|- <br />
| 3 || xavivars || Xavi Ivars || <br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50215PMC election/20142014-09-22T03:29:03Z<p>Ksnmi: /* PMC */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|-<br />
| <s>18</s> || <s>sortiz</s>|| <s>Sergio Ortiz</s> ||<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|- <br />
| 29 || ksnmi|| Akshay Minocha || <br />
|- <br />
| 22 || xavivars || Xavi Ivars || <s>(backup)</s><br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
|1 || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
|2 || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
|3 || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50214PMC election/20142014-09-22T03:28:35Z<p>Ksnmi: /* PMC */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|-<br />
| <s>18</s> || <s>sortiz</s>|| <s>Sergio Ortiz</s> ||<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|- <br />
| 29 || ksnmi|| Akshay Minocha || <br />
|- <br />
| 22 || xavivars || Xavi Ivars || <s>(backup)</s><br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
| - || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|-<br />
| - || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
|-<br />
| - || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50213PMC election/20142014-09-22T03:28:09Z<p>Ksnmi: /* PMC */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|-<br />
| <s>18</s> || <s>sortiz</s>|| <s>Sergio Ortiz</s> ||<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|- <br />
| 29 || ksnmi|| Akshay Minocha || <br />
|- <br />
| 22 || xavivars || Xavi Ivars || <s>(backup)</s><br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
| - || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
| - || sortiz || Sergio Ortiz Rojas || 17:42, 10 September 2014 (CEST)<br />
| - || jezral || Tino Didriksen || 9:51, 10 September 2014(CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50212PMC election/20142014-09-22T03:24:01Z<p>Ksnmi: /* Census */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
Draft Census for 2014 - Please confirm - <br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|-<br />
| <s>18</s> || <s>sortiz</s>|| <s>Sergio Ortiz</s> ||<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|- <br />
| 29 || ksnmi|| Akshay Minocha || <br />
|- <br />
| 22 || xavivars || Xavi Ivars || <s>(backup)</s><br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
| - || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=PMC_election/2014&diff=50211PMC election/20142014-09-22T03:22:53Z<p>Ksnmi: /* Census */</p>
<hr />
<div>== 2014 ==<br />
=== Calendar ===<br />
<br />
* Designation of an Election Board: September 8.<br />
<br />
* Election board publishes temporary census: September 22.<br />
<br />
* Election board publishes definitive census after solving amends: October 1.<br />
<br />
* Candidates come forward until: October 8.<br />
<br />
* Candidates are proclaimed: October 9.<br />
<br />
* Elections run from October 10 until October 16.<br />
<br />
* Results proclaimed: October 19.<br />
<br />
* New PMC in office: October 27.<br />
<br />
<br />
=== Census ===<br />
This is the temporary census for 2014 (based on [[PMC_election/2013|2013 election]] census)<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! SF Username !! Full name <br />
|-<br />
| 1 || artetxem || Mikel Artetxe<br />
|-<br />
| 2 || atoral || Antonio Toral<br />
|-<br />
| 3 || bechapertium || Bernard Chardonneau<br />
|-<br />
| 4 || eltorre || Daniel Torregrosa<br />
|-<br />
| 5 || espla || Miquel Esplà<br />
|-<br />
| 6 || fulupjakez || Fulup Jakez<br />
|-<br />
| 7 || g-ramirez || Gema Ramírez Sánchez<br />
|-<br />
| 8 || hectoralos || Hèctor Alòs i Font<br />
|-<br />
| 9 || japerez || Juan Antonio Pérez Ortiz<br />
|-<br />
| 10 || jernejvicic || Jernej Vicic<br />
|-<br />
| 11 || jimregan || Jimmy O Regan<br />
|-<br />
| 12 || juanpabl || Juan Pablo Martínez Cortés<br />
|-<br />
| 13 || mginesti || Mireia Ginestí<br />
|-<br />
| 14 || mlforcada || Mikel L. Forcada<br />
|-<br />
| 15 || nordfalk || Jacob Nordfalk<br />
|-<br />
| 16 || sanmarf || Felipe Sánchez Martínez<br />
|-<br />
| 17 || selimcan || Ilnar Salimzyan<br />
|-<br />
| 18 || sortiz || Sergio Ortiz<br />
|-<br />
| 19 || spectre360 || Francis Tyers<br />
|-<br />
| 20 || tunedal || Per Tunedal<br />
|-<br />
| 21 || unhammer || Kevin Brubeck Unhammer<br />
|-<br />
| 22 || xavivars || Xavi Ivars<br />
|-<br />
| 23 || tasunke || Stephen Tigner<br />
|-<br />
| 24 || vitaka || Víctor Manuel Sánchez Cartagena<br />
|-<br />
| 25 || jonorthwash || Jonathan North Washington<br />
|-<br />
| 26 || keld || Keld Simonsen<br />
|-<br />
| 27 || fpetkovski || Filip Petkovski<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma<br />
|-<br />
| 29 || ksnmi || Akshay Minocha<br />
|-<br />
| 30 || aida27 || Aida Sundetova<br />
|-<br />
| 31 || tachyonsvyd || Aboobacker MK<br />
|-<br />
| 32 || kindleton || Raveesh Motlani<br />
|- <br />
| 33 || sushain97 || Sushain K. Cherivirala <br />
|-<br />
| 34 || jezral || Tino Didriksen<br />
|-<br />
| 35 || serpis || Cristina Guillem Carbonell<br />
|}<br />
<br />
=== Election board ===<br />
<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! comment <br />
|-<br />
| <s>18</s> || <s>sortiz</s>|| <s>Sergio Ortiz</s> ||<br />
|-<br />
| 28 || pankajsharma92 || Pankaj Kumar Sharma || <br />
|- <br />
| 29 || ksnmi|| Akshay Minocha || <br />
|- <br />
| 22 || xavivars || Xavi Ivars || <s>(backup)</s><br />
|}<br />
<br />
=== Candidates === <br />
==== President ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|-<br />
| - || -|| -|| -<br />
|}<br />
<br />
==== PMC ====<br />
{| class="wikitable" border="1"<br />
|-<br />
! !! Username !! Full name !! Date <br />
|- <br />
| - || spectre360|| Francis M. Tyers|| 10:12, 10 September 2014 (CEST)<br />
|}<br />
<br />
[[Category:Project Management Committee]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Automatic_text_normalisation&diff=47785Automatic text normalisation2014-03-23T13:34:30Z<p>Ksnmi: </p>
<hr />
<div><br />
==General ideas==<br />
<br />
* Diacritic restoration<br />
* Reduplicated character reduction<br />
** How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?<br />
<br />
==Code switching==<br />
<br />
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... <br />
** Maybe this will be too heavy for the on the run application ( needs discussion )<br />
* Is it possible to identify sub-spans of text ? e.g. <br />
** ''LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!''<br />
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']'''<br />
* Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure. <br />
** So we can probably do this to a certain extent LR in a single pass. <br />
** We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+<br />
** It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en". <br />
** It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.<br />
<br />
<br />
<br />
==To do list== <br />
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment<br />
*Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words<br />
*add most frequently occuring english abbreviations to the list<br />
*'''From Comments''' = tu -> tú, not tuilleadh<br />
*change some_known capitals for diff. languages<br />
*suggestions for including spelling correction<br />
**Example, Taisbeánta should be Taispeána<br />
**repetitions haha hehe can be included for this as well <br />
** thought for such a single repitition<br />
** should all replacements go through n-gram verification?<br />
** Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down<br />
** Scope for addition of rules... vowels are not repeated<br />
** mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this<br />
** only characters which get repeated are ll nn rr <br />
[[Category:Development]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Automatic_text_normalisation&diff=47784Automatic text normalisation2014-03-23T13:32:05Z<p>Ksnmi: </p>
<hr />
<div><br />
==General ideas==<br />
<br />
* Diacritic restoration<br />
* Reduplicated character reduction<br />
** How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?<br />
<br />
==Code switching==<br />
<br />
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... <br />
** Maybe this will be too heavy for the on the run application ( needs discussion )<br />
* Is it possible to identify sub-spans of text ? e.g. <br />
** ''LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!''<br />
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']'''<br />
* Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure. <br />
** So we can probably do this to a certain extent LR in a single pass. <br />
** We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+<br />
** It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en". <br />
** It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.<br />
<br />
<br />
<br />
==To do list== <br />
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment<br />
*Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words<br />
*add most frequently occuring english abbreviations to the list<br />
*'''From Comments''' = tu -> tú, not tuilleadh<br />
*change some_known capitals for diff. languages<br />
*suggestions for including spelling correction<br />
**Example, Taisbeánta should be Taispeána<br />
**repetitions haha hehe can be included for this as well <br />
** thought for such a single repitition<br />
** should all replacements go through n-gram verification?<br />
** Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down<br />
** Scope for addition of rules... vowels are not repeated<br />
** mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this<br />
<br />
[[Category:Development]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Automatic_text_normalisation&diff=47783Automatic text normalisation2014-03-23T13:31:42Z<p>Ksnmi: </p>
<hr />
<div><br />
==General ideas==<br />
<br />
* Diacritic restoration<br />
* Reduplicated character reduction<br />
** How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?<br />
<br />
==Code switching==<br />
<br />
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... <br />
** Maybe this will be too heavy for the on the run application ( needs discussion )<br />
* Is it possible to identify sub-spans of text ? e.g. <br />
** ''LOL rte showin dáil in irish 4 seachtan na gaeilge, an ceann comhairle hasnt a scooby wots bein sed! his face is classic ha!''<br />
** '''[en''' LOL rte showin dáil in irish 4'''] [ga''' seachtan na gaeilge, an ceann comhairle'''] [en''' hasnt a scooby wots bein sed! his face is classic ha!''']'''<br />
* Ideas: You will rarely have single word spans of X-Y-X-Y-X-Y e.g. "la family está in la house." "la família está in the house." is probably a more frequent structure. <br />
** So we can probably do this to a certain extent LR in a single pass. <br />
** We probably shouldn't consider a single word code switching, but perhaps a span of 2-3+<br />
** It's like a state machine, you are in state "en", and you see something that makes you flip to state "ga", then you see another thing that makes you flip to state "en". <br />
** It could also be that at some point you are not sure, so what you should do is keep both options open, e.g. you would keep adding to en/ga.<br />
<br />
==To do list== <br />
<br />
==To do list== <br />
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment<br />
*Make list of most frequently occurring non dictionary words (ga), these might be abbreviations. Check for these words<br />
*add most frequently occuring english abbreviations to the list<br />
*'''From Comments''' = tu -> tú, not tuilleadh<br />
*change some_known capitals for diff. languages<br />
*suggestions for including spelling correction<br />
**Example, Taisbeánta should be Taispeána<br />
**repetitions haha hehe can be included for this as well <br />
** thought for such a single repitition<br />
** should all replacements go through n-gram verification?<br />
** Words like CAP REM mean should stay the same. I'm over-reaching because of the trie implementation.. need to weigh down<br />
** Scope for addition of rules... vowels are not repeated<br />
** mhoiiiiiilllllll -> mhoill is the correct form.. I got mhoil.. Will have to look in the ngram model for this<br />
<br />
[[Category:Development]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Automatic_text_normalisation&diff=47778Automatic text normalisation2014-03-23T12:55:21Z<p>Ksnmi: </p>
<hr />
<div><br />
==General ideas==<br />
<br />
* Diacritic restoration<br />
* Reduplicated character reduction<br />
** How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?<br />
<br />
==Code switching==<br />
<br />
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... <br />
** Maybe this will be too heavy for the on the run application ( needs discussion )<br />
<br />
<br />
==To do list== <br />
*Feed charlifter with n-grams ( works best with a trigram model ). This would improve the diacritics at the moment<br />
*Make list of most frequently occurring non dictionary words, these might be abbreviations.. <br />
*add most frequently occuring english abbreviations to the list</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=Automatic_text_normalisation&diff=47777Automatic text normalisation2014-03-23T12:52:57Z<p>Ksnmi: </p>
<hr />
<div><br />
==General ideas==<br />
<br />
* Diacritic restoration<br />
* Reduplicated character reduction<br />
** How to learn language specific settings? -- e.g. in English certain consonants can double, but others cannot, same goes for vowels. Can we learn these by looking at a corpus ?<br />
<br />
==Code switching==<br />
<br />
* For the language subpart... we can actually train and keep copies of most frequently corrected words across languages and then refer to that list... <br />
** Maybe this will be too heavy for the on the run application ( needs discussion )</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47619User:Ksnmi/Application2014-03-21T09:19:21Z<p>Ksnmi: /* WorkPlan */</p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English, the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
**I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS. In Proceedings for ETRA 2014, Florida (Recognition of translator expertise using sequences of fixations and keystrokes, Pascual Martinez-Gomez, Akshay Minocha, Jin Huang, Michael Carl, Srinivas Bangalore, Akiko Aizawa )<br />
**I had been associated to SketchEngine as a Programmer for almost a year. - My work during the initial phase there resulted in a publication at ACL SIGWAC 2013, which was about building Quality Web Corpus from Feeds Information.<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff, HCI, Carnegie Mellon University.<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A commitment of 30 hours a week would not be a problem, I might even work on some more so as to complete the weekly goals as described in the Workplan. <br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish Research Paper(s) from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
https://github.com/akshayminocha5/apertium-non-standard-input-task<br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|Deliverable #1<br />15 June<br />
|<br />
Compile the Weekly codes for 7 languages and Produce the deliverable.<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
!colspan="2" style="text-align: right"|Deliverable #2<br />19 July<br />
|<br />
Compile the Weekly codes for about 7 languages and Produce the deliverable.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
! 12 !! 11 August - 18 August<br />
|<br />
Integrating and testing support for the set of deliverables, along with information from last week.<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />18 August - 22 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other reqd. deliverables.<br />
|}<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47559User:Ksnmi/Application2014-03-21T00:00:33Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English, the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
**I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS. In Proceedings for ETRA 2014, Florida (Recognition of translator expertise using sequences of fixations and keystrokes, Pascual Martinez-Gomez, Akshay Minocha, Jin Huang, Michael Carl, Srinivas Bangalore, Akiko Aizawa )<br />
**I had been associated to SketchEngine as a Programmer for almost a year. - My work during the initial phase there resulted in a publication at ACL SIGWAC 2013, which was about building Quality Web Corpus from Feeds Information.<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff, HCI, Carnegie Mellon University.<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A commitment of 30 hours a week would not be a problem, I might even work on some more so as to complete the weekly goals as described in the Workplan. <br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish Research Paper(s) from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
https://github.com/akshayminocha5/apertium-non-standard-input-task<br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47469User:Ksnmi/Application2014-03-20T11:29:59Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English, the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
**I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS. In Proceedings for ETRA 2014, Florida (Recognition of translator expertise using sequences of fixations and keystrokes, Pascual Martinez-Gomez, Akshay Minocha, Jin Huang, Michael Carl, Srinivas Bangalore, Akiko Aizawa )<br />
**I had been associated to SketchEngine as a Programmer for almost a year. - My work during the inital phase there resulted in a publication at ACL SIGWAC 2013, which was about building Quality Web Corpus from Feeds Information.<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A commitment of 30 hours a week would not be a problem, I might even work on some more so as to complete the weekly goals as described in the Workplan.<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
https://github.com/akshayminocha5/apertium-non-standard-input-task<br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47468User:Ksnmi/Application2014-03-20T11:08:37Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English, the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
**I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS. In Proceedings for ETRA 2014, Florida (Recognition of translator expertise using sequences of fixations and keystrokes, Pascual Martinez-Gomez, Akshay Minocha, Jin Huang, Michael Carl, Srinivas Bangalore, Akiko Aizawa )<br />
**I had been associated to SketchEngine as a Programmer for almost a year. - My work during the inital phase there resulted in a publication at ACL SIGWAC 2013, which was about building Quality Web Corpus from Feeds Information.<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A commitment of 30 hours a week would not be a problem, I might even work on some more so as to complete the weekly goals as described in the Workplan.<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
https://github.com/akshayminocha5/apertium-non-standard-input-task<br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47463User:Ksnmi/Application2014-03-20T10:10:11Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English, the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
**I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS. In Proceedings for ETRA 2014, Florida (Recognition of translator expertise using sequences of fixations and keystrokes, Pascual Martinez-Gomez, Akshay Minocha, Jin Huang, Michael Carl, Srinivas Bangalore, Akiko Aizawa )<br />
**I had been associated to SketchEngine as a Programmer for almost a year. - My work during the inital phase there resulted in a publication at ACL SIGWAC 2013, which was about building Quality Web Corpus from Feeds Information.<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A commitment of 30 hours a week would not be a problem, I might even work on some more so as to complete the weekly goals as described in the Workplan.<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47461User:Ksnmi/Application2014-03-20T09:13:06Z<p>Ksnmi: /* Literature Review */</p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
**I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS. In Proceedings for ETRA 2014, Florida (Recognition of translator expertise using sequences of fixations and keystrokes, Pascual Martinez-Gomez, Akshay Minocha, Jin Huang, Michael Carl, Srinivas Bangalore, Akiko Aizawa )<br />
**I had been associated to SketchEngine as a Programmer for almost a year. - My work during the inital phase there resulted in a publication at ACL SIGWAC 2013, which was about building Quality Web Corpus from Feeds Information.<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A committment of 30 hours a week would not be a problem, I might even work on, some more so as to complete the weekly goals as describes in the Workplan.<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47460User:Ksnmi/Application2014-03-20T09:09:16Z<p>Ksnmi: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
**I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS. In Proceedings for ETRA 2014, Florida (Recognition of translator expertise using sequences of fixations and keystrokes, Pascual Martinez-Gomez, Akshay Minocha, Jin Huang, Michael Carl, Srinivas Bangalore, Akiko Aizawa )<br />
**I had been associated to SketchEngine as a Programmer for almost a year. - My work during the inital phase there resulted in a publication at ACL SIGWAC 2013, which was about building Quality Web Corpus from Feeds Information.<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A committment of 30 hours a week would not be a problem, I might even work on, some more so as to complete the weekly goals as describes in the Workplan.<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47459User:Ksnmi/Application2014-03-20T08:40:10Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
<br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS.<br />
<br />
**I had been associated to SketchEngine as a Programmer for almost a year.<br />
<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff<br />
<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
<br />
**A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
*'''List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.'''<br />
<br />
**I am dedicated towards this project and am very excited about working on it. A committment of 30 hours a week would not be a problem, I might even work on, some more so as to complete the weekly goals as describes in the Workplan.<br />
<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47456User:Ksnmi/Application2014-03-20T08:23:10Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.''' - [[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
*'''List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.'''<br />
** I am a fourth year student at the International Institute of Information Technology, Hyderabad, India pursuing my Dual Degree (Btech+MS) in Computer Science and Digital Humanities. I have been inclined towards the linguistic studies even for my MS research. <br />
I have been associated with various kinds of projects, Some important one's are listed below -<br/><br />
<br />
**I'm an active member of the Translation Process Research Community from Europe and under the guidance of Michael Carl, Copenhagen Business school and Srinivas Bangalore from AT&T, USA. I have completed a project which models translators as expert or novice, based on their eye movements tracked while they are performing a translation Task, last summer at CBS.<br />
<br />
**I had been associated to SketchEngine as a Programmer for almost a year.<br />
<br />
**Was an initial contributor to the Health triangulation System, under the guidance of Dr. Jennifer Mankoff<br />
<br />
**Have worked vigorously on Corpus Linguistics and Data Mining tasks. to support other contributions.<br />
<br />
A detailed CV can be found at [[http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf This Link]]<br />
<br />
<br />
<br />
<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47455User:Ksnmi/Application2014-03-20T08:02:35Z<p>Ksnmi: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
[[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47454User:Ksnmi/Application2014-03-20T07:57:43Z<p>Ksnmi: /* Introduction */</p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. Having learned from the process in English the language independent module will be a good contribution. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
[[http://wiki.apertium.org/wiki/User:Ksnmi/Application#WorkPlan Workplan]]<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47453User:Ksnmi/Application2014-03-20T07:40:48Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Which of the published tasks are you interested in? What do you plan to do?'''<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
Link to the workplan<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
== WorkPlan ==<br />
<br />
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.<br />
My WorkPlan on the Timeline is as suggested below - <br />
<br />
{|class="wikitable"<br />
! week<br />
! dates<br />
!style="width: 25%"| goals<br />
! eval<br />
!style="width: 25%"| accomplishments<br />
!style="width: 35%"| notes<br />
|-<br />
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April<br />
|<br />
Dates 22 March - 20 April Work on English, Irish and Spanish. This along with discussion from mentors will give me an idea of the non-standard features in full, not limiting to just English. Ask for a language priority list <br />
|-<br />
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May<br />
|<br />
*Discuss the availability of the data in particular languages, start fetching tasks on sources<br />
where we have a dearth of resource and language models. <br />
*Get to know the mentors <br />
*Solve the issues related to the improvement of the pipeline and including into Apertium<br />
*See solutions for building binaries of the data being used<br />
|-<br />
! 1 !! 19 - 24 May<br />
|<br />
*language_priority_list[1]<br />
*Build Non standard data<br />
*Bootstrap for other resources<br />
*Check Corpus availability ( from the Community Bonding Period )<br />
*Implement new features solving non-standard issues<br />
*Produce results <br />
*Compare Using Apertium and other MT systems for improvement<br />
*Repeat Steps until verified<br />
|-<br />
! 2 !! 25 - 31 May<br />
|<br />
*language_priority_list[2] & language_priority_list[3]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 3 !! 1 - 7 June<br />
|<br />
*language_priority_list[4] & language_priority_list[5]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 4 !! 8 - 14 June<br />
|<br />
*language_priority_list[6] & language_priority_list[7]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 5 !! 15 - 21 June<br />
|<br />
*language_priority_list[8] & language_priority_list[9]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
!colspan="2" style="text-align: right"|midterm eval<br />23 - 27 June<br />
|<br />
|-<br />
! 6 !! 29 June - 5 July<br />
|<br />
*language_priority_list[10] & language_priority_list[11]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
<br />
|-<br />
! 7 !! 6 - 12 July<br />
|<br />
*language_priority_list[12] & language_priority_list[13]<br />
*Repeat Steps from Week1 for these languages<br />
*Improve and add non-standard->standard features to the pipeline as we go forward<br />
|-<br />
! 8 !! 13 - 19 July<br />
|<br />
*adding main monolingual languages worked on ( English,Spanish,Irish )<br />
*The one's which were worked on earlier ( in the initial phase )Improvement on these to be made and verified.<br />
|-<br />
! 9 !! 20 - 26 July<br />
|<br />
*Collecting results from all the comparison MT tasks and starting with the outline of the research paper<br />
|-<br />
! 10 !! 27 July - 2 August<br />
|<br />
*Improving/adding module efficiencies as and where necessary.<br />
*Since at the end all major issues in all the languages must have been understood.<br />
|-<br />
! 11 !! 3 - 10 August<br />
|<br />
Working on binaries to be implemented efficiently on Apertium<br />
|-<br />
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August<br />
|<br />
*Continuing work from last week<br />
*Making Documentation and other Deliverables<br />
|}<br />
<br />
<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47451User:Ksnmi/Application2014-03-20T06:23:21Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Which of the published tasks are you interested in? What do you plan to do?'''<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
Link to the workplan<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47449User:Ksnmi/Application2014-03-20T06:18:53Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Which of the published tasks are you interested in? What do you plan to do?'''<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
Link to the workplan<br />
<br />
== Primary Goal ==<br />
<br />
*Build non-standard to standard support for at least 15 languages.<br />
*Language Priority list to be decided by mentor.<br />
*Integrate the whole support efficiently with Apertium <br />
*Test it with other MT systems <br />
*Publish a Research Paper from the results of this work<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
== Prototype of the Toolkit == <br />
<br />
Too see the processing of the Prototype I created Please check the Image Below <br/><br />
[[File:Prototype_Processing_Flowchart.png|800px|link=http://wiki.apertium.org/wiki/File:Prototype_Processing_Flowchart.png]]<br />
<br/><br />
<br />
The functional Prototype involving the whole processing, and right now correctly working for English is available at the github link -><br />
<LINK TO GITHUB REPO ><br />
<br />
Below is the description of the working of the Modules - <br />
<br />
=== Step.1 Language Identification ===<br />
<br />
We would like to identify the source language, for further processing, right now the prototype includes supports for English(since the resources for the same are available when I built it) If we want to extend the support we may simply put resources in the specified format. <br/> We can use langid.py which supports language identification for around 97 different languages and all the 32 languages used in apertium are listed here too ''(Lui, Marco, and Timothy Baldwin. "langid. py: An Off-the-shelf Language Identification Tool." )'' I've implemented the same.<br />
<br />
<br />
=== Step.2 Removing Emoticons ( as Regular Expressions ) ===<br />
<br />
This would solve the symbolic emoticon representations. Like :) or even a series of repetitive emoticons like :):):):)<br />
A basic regular expression that looks for emoticons in three parts -> eyes, nose and mouth.<br />
<br />
=== Step.3 Superblank addition for Hashtag and other content specific terms/symbols ===<br />
Hashtags/Other symbols and '/' are to be put in superblanks. <br/> We are using '/' as superblanks because, in later stages it is being used as a delimiter<br />
<br />
=== Step.4 Tokenize ===<br />
<br />
Tokenize the text, the aim is to tokenize the text so that we can use the information in the further steps. A basic tokenizer is implemented for the proper processing of the non-standard data in the next steps.<br />
<br />
=== Step.5 Removing Emoticons ( as per-token ) ===<br />
<br />
There are a few emoticons listed in http:web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt which were extracted from the twitter database and they are not reducible to regular expressions hence we propose to handle them on a per token basis. <br/> Emoticons are mainly language independent and hence these steps are rather consistent with different language pairs.<br />
<br />
=== Step.6 Substituting Abbreviation (CONVENTIONAL) list. ===<br />
<br />
At this moment I have replaced the most frequently occurring abbreviations whose information is with us. We use this resource as a substitution list. <br/> Since, english resources are widely available online this substitution list was easy. <br/> I had a Discussion with Francis regarding such a resource for other languages, I suggest we would make a list of a few abbreviations and then automatically come up with high frequency words appearing in non standard texts which are not in the language’s dictionary. We can ask people in the community to help us figure out whether they are really abbreviations. The number of words each person would have to check is very less and hence this will be an easy and a productive process. <br />
<br />
=== Step.7 Handling Extended Words === <br />
<br />
From this step onwards, I chose to add a list for the following steps which would contain information per token in the form of <br/><br />
<br />
^original/candidate1/candidate2/…$<br />
<br />
Now we can safely improve the quality of the extended words Words with same characters being repeated >=3 in succession would be thought of as being extended words. The solution to this module is very similar to the exercise given and pointed out earlier in 2.2<br/><br />
<br />
=== Step.8 Apostrophe correction ===<br />
<br />
If word exists both in wordlist and in a apostrophe error list then we need to do further processing to find out which one to use - Classical disambiguation example, would be <br />
<br />
she’ll vs shell | hell vs. he’ll<br />
<br />
Since results of this module will be more accurate if we include the POS information as well. So the ideal way to go about this is either HMM or CG style. <br/><br />
<br />
In the prototype made, I have used ngram(trigrams,bigrams) information to disambiguate the use. <br/><br />
<br />
if word doesn’t exist in wordlist only in the apostrophe error list, then replace it with the mapping <br/><br />
<br />
For example, <br/><br />
input -> do’nt<br/><br />
Step 1 -> dont<br/><br />
Step 2 -> dont not in wordlist but dont in apostrophe error list<br/><br />
Step 3 -> replace from mapping in apostrophe_map_error <br/> dont -> don’t<br/><br />
<br />
=== Step.9 Solving issue for Abbreviation ( NON-CONVENTIONAL ) === <br />
<br />
It was very usually noted that people have a habit of referring to long words in abbreviated form. These words are incomplete but are an originating subsequence for word which should have been present <br />
<br />
rehab -> rehabilitation<br />
<br />
Steps <br/><br />
<br />
*Build a trie for the language wordlist.<br />
*If word not in wordlist, check trie for suggestions<br />
*Use n-gram to come up with the best suggestion. <br />
<br />
For later, Module efficiency can be increased by using HMM and POS information. <br />
<br />
=== Step.10 Checking words for Capitalisation ===<br />
<br />
In this module a simple heuristic is implemented where the words and the sentence ending punctuator information (!.?) is used to correct the capitalisation of the tokens. <br />
<br />
=== Step.11 Selecting the best candidate ===<br />
<br />
At the moment, the best candidate is the last word in the token suggestion list of the input. <br />
We can use this to re-construct the sentence. <br/> Replacing the superblanks that were set prior to the processing<br />
<br />
<br />
=== Step.12 Addition of escape sequence ===<br />
<br />
Addition of the escape sequences and also according to the Apertium Stream Format.<br />
<br />
<br />
== Outline of Research Paper == <br />
<br />
*We make a test corpus of ~3000 words of non-standard text:<br/> IRC/Facebook/Twitter. <br />
*This is translated to another language<br />
*We evaluate the translation quality of: <br/><br />
**Apertium<br />
**Moses (Europarl)<br />
**Apertium + program<br />
**Moses (Europarl) + program<br />
<br />
We can repeat the task above for the language-support built by us towards the end of the project and then share our results with the other people in the community by publishing a research paper out of the whole work.<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=File:Prototype_Processing_Flowchart.png&diff=47448File:Prototype Processing Flowchart.png2014-03-20T05:42:32Z<p>Ksnmi: </p>
<hr />
<div></div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47447User:Ksnmi/Application2014-03-20T05:33:50Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Which of the published tasks are you interested in? What do you plan to do?'''<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
Link to the workplan<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
=== Spelling mistakes ===<br />
<br />
These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. The above algorithm works decently, but this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> But the results were not convincing when we used this spell corrector. <br/> <br />
You can see the results ( https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE&usp=drive_web#gid=4 ) Sheet Name - '''Test'''. Some more efforts and comparisons were made in the same sheet. You can have a look at them too. <br/><br />
<br />
I plan to include ''''hfst-ospell'''' for the languages to correct the issues on spelling correction, although it was decided to down-weigh the contribution of this module with the earlier method.<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
=== Handling Links ===<br />
<br />
Imp from the perspective of Internal Handling with respect to the Apertium translator.<br />
<br />
Hyperlinks should be treated as superblanks rather than being translated. <br/> Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br />
As machine translation on the links changes the purpose of the same. (For example, say en->es translation of the text on Apertium -<br/><br />
<br />
http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Bull<br />
<br />
The above example would re-direct us to an undesirable page.<br />
<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47446User:Ksnmi/Application2014-03-20T05:21:37Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Which of the published tasks are you interested in? What do you plan to do?'''<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
Link to the workplan<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
=== Use of content specific terms ===<br />
<br />
Such as RT (ReTweet) and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br/> Terms such as these constitute an exhaustive list. After trial and error we seemed convinced that they should be processed as ''''superblanks''''.<br />
<br />
=== Use of Emoticons === <br />
<br />
People use emoticons very frequently in posts. These have to be ignored. Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following - http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt<br />
<br />
=== Use of Repetitive or Extended Words ===<br />
<br />
This is the most commonly occurring issue in the non-standard text. Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. Resources for other languages can be put in and it would work the same way. <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as “uuuu” is given which would standardise to “you” so “uuuu”->”u”->”you” Hence abbreviation processing should always be after this step. Preferably at the end. <br/> #Punctuation repetition is not a problem for us. <br/> Since Apertium handles ‘!!!’ similar to ‘!’<br />
<br />
<br />
=== Handline repetitive expressions ( with spaces ) ===<br />
<br />
Expressions such as - <br/><br />
<br />
'''“ ha ha ha ha”'''<br />
'''“ he he he he”''' <br />
<br />
May produce errors in translation.<br />
Translating “he he” on apertium en-es will give us “él él” <br/> The solution to this is simple. After rectifying handful such expressions we can make a list for them and trim spaces in between so they can become non-functional while translation.<br />
<br />
=== Handling Hashtags === <br />
<br />
'''#ThisIsDoneNow. ''' <br/> We are seeing hashtags as expressions or terms which are trending at the moment. They may also be seen as an identification to a particular topic on the web. <br/> Earlier I wrote a heuristic script for hashtag disambiguation but at the moment, we are not doing it and processing hashtags as superblanks. <br/> </br> Things that I noticed while processing hashtags <br />
*Cases in Hashtags -><br />
**Words are separated by Capitalis <br/> For example, #ForLife -> For Life<br />
**Words are not separated by Capitals <br/> For example, #Fridayafterthenext -> Friday after the next<br />
'''Solution -'''<br />
*Hashtag disambiguation can be easilydone by any of the two ways We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> <br />
So Words in hashtags should be represented as a ‘lone sentence’. <br />
Example, “Today comes monday again, #whereismyextrasunday” -> Today comes monday again. “Where is my extra Sunday”<br />
<br />
=== Abbreviation and Acronyms (CONVENTIONAL) ===<br />
<br />
In the tweets by matching the most frequently occurring non dictionary words, I came up with the list of a few abbreviations.<br />
For English These are - http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt The solution to improve translation due to the occurrence of these is simple. When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as <br />
<br />
r->are, u->you, 2->to <br />
<br />
are also included. This list can be increased by further analysing the data.<br />
<br />
=== Abbreviations (UNCONVENTIONAL) ===<br />
<br />
Examples, <br />
rehab -> rehabilitation<br />
betw -> between<br />
<br />
These are words just in the shortened forms. <br/> A suggested solution which is implemented in the prototype can be seen. It shows the use of dictionary helping in predicting the best grammatical word to fit with the best possible chance.<br />
<br />
=== Apostrophe correction ===<br />
<br />
dont | do’nt | dont’ -> don’t<br />
shell -> she’ll | shell (disambiguation)<br />
<br />
A module for the same has been built to correct the apostrophes it was built using most commonly used apostrophe words in English (refer [http:/web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt] ) <br />
<br />
=== Diacritics restoration === <br />
<br />
Since we want to make our toolkit language independent our aim is to solve all the different kinds of problems faced. If we consider languages apart from English, diacritics errors are the most frequently occurring errors that are seen. <br/> Example, <br/><br />
¿Qué le pasó a mi discográfia de RAMMSTEIN? #MeCagoEnLaPuta <br />
discográfia -> discografía (´ in the wrong place)<br />
<br />
For diacritic restoration, charlifter ( http://code.google.com/p/charlifter-l10n/ ) can be included in the pipeline, which would restore the original text to its correct format in terms of the diacritic problem. <br />
<br />
<br />
=== Spacing and hyphen variation & optional hyphen ===<br />
<br />
Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus (either what apertium is currently using or we can come up with something real quick using the technique described in my paper) ''Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. '' <br/> For extending language support to different languages, In case the corpus is not present, we can generate some million words of corpus to help us build characteristics of the language model. <br/> With this we can use a trigram based model( or higher n-gram) or use POS information and user a more efficient HMM model to predict the most probably occurring word, by training on the reference corpus. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br />
<br />
<br />
<br />
<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
<br/><br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
<br/><br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
<br/><br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
<br/><br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47445User:Ksnmi/Application2014-03-20T04:50:57Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Which of the published tasks are you interested in? What do you plan to do?'''<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
Link to the workplan<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - [https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 Link] <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
Please Check it out on the github link -> [ https://github.com/akshayminocha5/non_standard_repitition_check GitHub-Non-Standard-reduction ]<br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47444User:Ksnmi/Application2014-03-20T04:48:40Z<p>Ksnmi: </p>
<hr />
<div>== Introduction ==<br />
<br />
This section contains some points of introduction from my side.<br />
<br />
<br />
*'''Name''' : Akshay Minocha<br />
<br />
*'''E-mail address''' : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*'''Other information that may be useful to contact you''': nick on the #apertium channel: '''''ksnmi'''''<br />
<br />
*'''Why is it you are interested in machine translation?'''<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*'''Why is it that they are interested in the Apertium project?'''<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*'''Which of the published tasks are you interested in? What do you plan to do?'''<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*'''Include a proposal, including'''<br />
**'''Reasons why Google and Apertium should sponsor it''' - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*'''And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.'''<br />
Link to the workplan<br />
<br />
<br />
== Coding Challenges ==<br />
<br />
=== Analysing the issues in non-standard data ===<br />
I created a random set of 50 non-standard tweets and analysed them individually to see what goes wrong while performing the translation task. <br/> Details of the analysis can be found on the following link - [https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 Link] <br/> In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011). <br/> '''TranslationAnalysisSheet'''<br />
=== The Extended word reduction task ''(Mailing list)'' ===<br />
At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/><br />
<br />
=== Corpus Creation ===<br />
<br />
'''Separate task on Corpus Creation for English''' -> <br />
<br />
* With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
*Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
*Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
== Non Standard features in the Text == <br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, The prototype later will describe how I plan to use the modules below. For some the order wont be important as we aim to make the whole structure regardless of the input language.<br />
<br />
== Literature Review == <br />
<br />
There are many sites <ref> http://transl8it.com/ </ref>, <ref> http://www.lingo2word.com/translate.php </ref>, <ref> http://www.dtxtrapp.com/ </ref> on the internet that offer SMS English to English translation services. However the technology behind these sites is simple and uses straight dictionary substitution, with no language model or any other approach to help them disambiguate between possible word substitutions. <ref> Raghunathan, Karthik, and Stefan Krawczyk. CS224N: Investigating SMS text normali(z)ation using statistical machine translation. Technical Report, 2009.</ref> <br/><br />
<br />
*There have been a few attempts to improve the machine translation task for non-standard data. One of the preliminary research include <ref> Jehl, Laura Elisabeth. "Machine translation for twitter." (2010). </ref> Where the comparison between the linguistic characteristics of Europarl data and Twitter data is made. The methodology suggested relies heavily on the in-domain data to improve on the quality for further steps. The Evaluation metric shows an improvement 0.57% BLEU score corressponding to the set of improvement on a set of 600 sentences. Major suggestion from this research - t hashtags, @usernames, URLs should not be treated like regular words. This was the mistake we were doing earlier and didn’t help much on the translation task. They also follow the technique of putting in xml markup in the source text to work on it like super blanks. <br/> '''Issues in this case -''' <br />
**Working on building a bi-lingual resource from in-domain data. <br />
**Other sources of non standard data don’t see to get a significant improvement <br />
**BLEU score improvement marginal<br />
<br/><br />
*This is a standard research<ref>Sproat, Richard, et al. "Normalis(z)ation of non-standard words." Computer Speech & Language 15.3 (2001): 287-333.</ref> on the Non-Standard Words, (NSW) It suggests that Non-standard words are more ambiguous with respect to ordinary words in the ways of pronunciation and interpretation. In many applications, it is desirable to “normalize” text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. They have generally categorized numbers, abbreviations, other markup, url’s, handled capitalisation, etc. A very interesting method on tree based abbreviation model has been suggested in the research, which can give us ideas on improving our current abbreviation model or just have another addition to it in the model. This includes suggestion for vowel Dropping, shortened words and first syllable usage. <br/> The issue with most of the research is the limitation to a particular language in this case English. They have standardized the most common points of english leaving scope for a lot of improvement. In our kind of processing at the moment we are not considering any specific markup techniques within the pipeline but this paper shows some promising work on the same which can be useful for developers, and other users who want to analyse the data in more detail. Such a convention can be added easily after conducting experiments and seeing results.<br />
<br/><br />
*This <ref> Pennell, Deana, and Yang Liu. "A Character-Level Machine Translation Approach for Normalis(z)ation of SMS Abbreviations." IJCNLP. 2011. </ref> is a completely different approach where the author tries to solve the problem by proposing a character level machine translation approach. The issue here is accuracy, they have used the Jazzy spell checker<ref> Mindaugas Idzelis. 2005. Jazzy: The java open source spell checker. </ref> as baseline and the compared it with previous such research. The issue here is the huge resource being used up in training and tuning the MT system and also, such a system would have complications being included on the run with Apertium. <br />
*This research <ref> Lopez, Adam, and Matt Post. "Beyond bitext: Five open problems in machine translation." </ref> is more idea centric, where the author says the MT research is far from complete and we face many challenges. With our project we aim to target these problems specifically. Translation of Informal text and Translation of low resource language pairs are the ones which concern Apertium and us the most. <br />
*This research <ref> Lo, Chi-kiu, and Dekai Wu. "Can informal genres be better translated by tuning on automatic semantic metrics." Proceedings of the 14th Machine Translation Summit (MTSummit-XIV) (2013). </ref> identifies the difficulties faced not only by the Translation community with the web forum data and other informal genres but also by people working on semantic role labelling, and probably many more who rely on data analytics, etc. <br/> They propose that evaluation of systems which are MEANT tuned performed significantly better than other systems tuned according to BLEU and TER. With our module the Error analysis suggested would improve on the system, because of a significant rise in the number of known words, grammar and word sense, the semantic parser being used here would perform better. <br />
*This research<ref>Pennell, Deana L., and Yang Liu. "Normalis(z)ation of informal text." Computer Speech & Language 28.1 (2014): 256-277.</ref> idea’s to the approach very similar to ours. But they have focussed mainly on the abbreviated word re-modelling and expansion, by implementing a character based translation model. <br />
*Inspired by this <ref>S. Bangalore, V. Murdock, G. Riccardi - Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system, 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002), pp. 1–7<br />
</ref> research, Srinivas Bangalore has suggested a method of bootstrapping from the data on the chat forums and other informal sources. So that we can build up abbreviation resources for a particular language. The way we can proceed with this task in Apertium, as I had suggested before was to first take in a list of few abbreviations and the use them to suggest what other more frequent words in the data might also count as abbreviations. This resource can be verified and then included for building up the system for the particular language.<br />
<br />
<br />
==References==<br />
<references/><br />
<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47442User:Ksnmi/Application2014-03-20T03:22:48Z<p>Ksnmi: </p>
<hr />
<div><br />
<br />
<br />
== Some details about myself ==<br />
*Name : Akshay Minocha<br />
<br />
*E-mail address : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*Other information that may be useful to contact you: nick on the #apertium channel: ksnmi <br />
<br />
*Why is it you are interested in machine translation?<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*Why is it that they are interested in the Apertium project?<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analysing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*Which of the published tasks are you interested in? What do you plan to do?<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*Include a proposal, including<br />
**a title,<br />
**reasons why Google and Apertium should sponsor it - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.<br />
#Draft version at the moment ( 13th March, 2014 )<br />
<br />
== Coding Task == <br />
<br />
*'''Points and my progress on the Coding Task that was posted on the Ideas page of this project''' -><br />
**A test corpus from tweets collected earlier, has been collected. Some general trends were seen in the case of non-standard input. Most frequented sample set is put on [https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 Link]<br />
**In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011), the translation by apertium and also my comment on each of the translations.<br />
<br />
<br />
== Corpus Creation == <br />
<br />
'''Separate task on Corpus Creation''' -> <br />
*I created several types of non-standard corpus for the purpose of analysis, and have taken the above set of 50 tweets from random parts of these.<br />
** With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
**Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
**Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, These are if handled in the sequence of their mention below, would create the most effective standard text -><br />
<br />
*'''Use of content specific terms''' -> <br />
**Such as RT (ReTweet) @<referral> and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br />
<br />
*'''Handling Links(Imp) ->'''<br />
**Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br/> Suggestion -> Links at the moment are not being ignored. They are marked with a <br/> *(unknown) This should be noted and corrected. As machine translation on the links <br/> changes the purpose of the same.<br/> For example, say en->es translation of <br/> http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Toro ) <br/> Current translation by Apertium -> <br/> http://en.wikipedia.org/wiki/Rojo_Bull <br/> which is incorrect. The above example would re-direct us to an undesirable page.<br />
<br />
*'''Use of Emoticons ->'''<br />
**People use emoticons very frequently in posts. These have to be ignored.<br/> Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following -> [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons most commonly used (546) ] (Already mentioned above) <br/> '''Solution''' -> <br/> If we want the expression not to be lost in translation then these can be kept as it is. Otherwise if apertium treats them as punctuators we should remove them. <br/> Since the popular one’s include characters and words as well. We WON’T be using regular expressions which would limit our reach. <br/><br />
<br />
*'''Use of Repetitive or Extended Words -> '''<br />
**This is the most commonly occurring issue in the non-standard text. <br/> Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. <br/>It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as <br/> “uuuu” is given which would standardize to “you” so “uuuu”->”u”->”you”<bt/> Hence abbreviation processing should always be after this step. Preferably at the end.<br />
**Punctuation repetition is not a problem for us. <br/> Since Apertium handles '''!!!''' similar to '''!''<br />
<br />
*'''Handling of Hashtags ->'''<br />
**'''Cases in Hashtags ->'''<br />
***Words are separated by Capitals <br/> For example, #ForLife -> For Life<br />
***Words are not separated by Capitals <br/> For example, #Fridayafterthenext <br />
**'''Solution''' - <br/> Hashtag disambiguation can be easily done by any of the two ways -> We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> It is Important to separate the words mentioned in the hashtags. Hashtags are supposed to convey the emotion or the summary of the tweet. Hence most frequent not in context to the grammatical surroundings. <br />
**So Words in hashtags should be represented as a ‘lone sentence’. <br/> Example, “Today comes monday again, #whereismyextrasunday” -> <br/> Today comes monday again. “Where is my extra Sunday”<br />
<br />
*'''Abbreviation and Acronyms ->'''<br />
**In the tweets by matching the most frequently occurring non dictionary words, I came up<br />
with the list of a few abbreviations.<br/> These are -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt English_abbreviations_list_non_standard] <br/> The solution to improve translation due to the occurrence of these is simple. <br/> When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as r->are, u->you, 2->to are also included. This list can be increased by further analysing the data. <br/><br />
<br />
*'''Spelling mistakes ->'''<br />
**These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. <br/> Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. Although this algorithm worked well, and is also implemented by the PyEnchant library on python <br/> >> d = enchant.request_dict("en_US") <br/> >> d.suggest("Helo") <br/> ['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"] <br/> this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> Alternately from the large bag of words we can also probabilistically find out the most likely spelling for the word. <br />
<br />
*'''Apostrophe correction ->'''<br />
** There are some words where we can predict easily whether the apostrophe exists or not <br/> for example - theyll -> they’ll <br/> or im -> i’m <br/> but ambiguity exists in words like - > <br/> hell -> he’ll or hell ? <br/> shell -> she’ll or shell ? <br/> Here the apostrophe makes a difference in the total sense of the words as they are two completely different words. <br/> This can be improved by using the predicting mechanism discussed where the trigram probabilities of the text from the standard corpus will be compared and the results will be reported. <br/> List of apostrophe occurrences from a standard corpus collected by me earlier -> [http://web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt List of apostrophe occurrences_standard_English]<br />
<br />
*'''Spacing and hyphen variation & optional hyphen -> '''<br />
***Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus ( either what apertium is currently using or we can come up with something real quick using the technique described in my paper) -> ( Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. ) <br/> With this we can use a trigram based model( or higher n-gram) to predict the most probably occurring word. We can also train on the reference corpus to predict the word. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br/><br />
== Conclusion ==<br />
The project in effectively important since non standard text is not handled by many MT systems, and it is important because we have to go with the trend of the language used today to convey the meaning intact to a different native speaker.<br />
<br />
[[Category:GSoC 2014 Student proposals|Ksnmi]]</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi/Application&diff=47116User:Ksnmi/Application2014-03-14T18:21:48Z<p>Ksnmi: Created page with "*Name : Akshay Minocha *E-mail address : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in *Other information that may be useful to contact you: nick on the #ape..."</p>
<hr />
<div>*Name : Akshay Minocha<br />
<br />
*E-mail address : akshayminocha5@gmail.com | akshay.minocha@students.iiit.ac.in<br />
<br />
*Other information that may be useful to contact you: nick on the #apertium channel: ksnmi <br />
<br />
*Why is it you are interested in machine translation?<br />
** I'm interested in Language, and machine translation is a part of handling the language change. I have been working with understanding both theoretically as well as through building MT systems the methods involved in the Translation process. <br />
<br />
*Why is it that they are interested in the Apertium project?<br />
** This current project on "non-standard text input" has everything I love to work on. From, Informal data from Twitter/IRC/etc., handling noise removal, building FST's, analyzing data and at the end building Machine Translation systems. I believe that this approach can be standardized for many source languages with the approach I have in mind. For the time being I'm sticking to English but changing ways is easy as well as interesting. Also, the translation quality we are working on should be intact when we are giving back to the community, well at least this is an important step. This is also one of the kind of projects whose implementation will help translation on all the language pairs on apertium at the end.<br />
<br />
*Which of the published tasks are you interested in? What do you plan to do?<br />
**I initially want to start working on the English and Español as the source language since we have plenty of informal data on social media available for these languages. After completing this task we can include the pair and build a standard to improve translation quality for other languages too.<br />
<br />
*Include a proposal, including<br />
**a title,<br />
**reasons why Google and Apertium should sponsor it - I'd love to work on the current project with the set of mentors. This project is important because the MT community on an open level should also welcome the change in the use of language, in the form of the popular non standard text, by the people. This will extend our reach to several people and practically increase the efficiency of the translation task no doubt.<br />
<br />
*And a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.<br />
#Draft version at the moment ( 13th March, 2014 )<br />
<br />
*'''Points and my progress on the Coding Task that was posted on the Ideas page of this project''' -><br />
**A test corpus from tweets collected earlier, has been collected. Some general trends were seen in the case of non-standard input. Most frequented sample set is put on [https://docs.google.com/spreadsheet/ccc?key=0ApJ82JmDw6DHdDBad1ZXay1LZDhQckpxcXZmQTl1VVE#gid=2 Link]<br />
**In the above link you will find details of the authenticity of the tweets ( collected for an earlier project hence the year 2011), the translation by apertium and also my comment on each of the translations.<br />
<br />
'''Separate task on Corpus Creation''' -> <br />
*I created several types of non-standard corpus for the purpose of analysis, and have taken the above set of 50 tweets from random parts of these.<br />
** With special symbols. The number of tweets were high and the list of emoticons from this was considerable. Ended up finding around 545 most frequently used emoticons ( list of emoticons from the twitter dataset can be found here [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons_NON_Standard] <br/> 'Number of Posts'' -> 475,179 <br/> ''Link'' -> [https://www.dropbox.com/s/lg3uizuefw978tr/emoticon_tweets Emoticon_dataset ]<br />
**Abbreviations are the words which are not in the dictionary but which are used on social platforms specially like Twitter where the users face a crunch in the limit of the characters. <br/> Around 100 most Common abbreviations from tweets collected over a period of time are listed in the following link -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt Abbreviations_english ] <br/>''Number of Posts'' -> 94,290 <br/> ''Link'' -> [https://www.dropbox.com/s/3cvvw7oewvvm0gs/abbreviations_non_standard_english abbreviations_english_dataset]<br />
**Repetitive or Extended words and punctuators -> Using a simple algorithm, I separated these occurrences. By generating a word list we know how the trend of using these words are. Also helps us to standardize it for further processing. <br/> ''Number of Posts'' -> 411,404 <br/>''Link'' -> [https://www.dropbox.com/s/yoe24xobmf4uyjo/extended_words_non_standard Extended_words_dataset]<br />
<br />
I analysed the most common categories of non-standard text occurrences and have summed it up below, These are if handled in the sequence of their mention below, would create the most effective standard text -><br />
<br />
*'''Use of content specific terms''' -> <br />
**Such as RT (ReTweet) @<referral> and hashtags in the case of twitter. These have to be Ignored and we should understand that this does not affect the translation quality much. The random ( any position ) use of the above however affects the machine translation system which are ahead in the pipeline for the processing of the text. Links are also present in most of the tweets. <br />
<br />
*'''Handling Links(Imp) ->'''<br />
**Not only for non standard but also for normal standard input this needs to taken into account in case of apertium at the moment.<br/> Suggestion -> Links at the moment are not being ignored. They are marked with a <br/> *(unknown) This should be noted and corrected. As machine translation on the links <br/> changes the purpose of the same.<br/> For example, say en->es translation of <br/> http://en.wikipedia.org/wiki/Red_Bull -> http://en.wikipedia.org/wiki/Rojo_Toro ) <br/> Current translation by Apertium -> <br/> http://en.wikipedia.org/wiki/Rojo_Bull <br/> which is incorrect. The above example would re-direct us to an undesirable page.<br />
<br />
*'''Use of Emoticons ->'''<br />
**People use emoticons very frequently in posts. These have to be ignored.<br/> Analysing the symbols which were present in the set of tweets, I found out that the most commonly occurring emoticons are the following -> [http://web.iiit.ac.in/~akshay.minocha/emoticons_list_non_standard.txt Emoticons most commonly used (546) ] (Already mentioned above) <br/> '''Solution''' -> <br/> If we want the expression not to be lost in translation then these can be kept as it is. Otherwise if apertium treats them as punctuators we should remove them. <br/> Since the popular one’s include characters and words as well. We WON’T be using regular expressions which would limit our reach. <br/><br />
<br />
*'''Use of Repetitive or Extended Words -> '''<br />
**This is the most commonly occurring issue in the non-standard text. <br/> Task given by Francis earlier on the mailing list was to standardise the output according to a dictionary. At the moment this works for English using the wordlist generated from the English dictionary. <br/> The dictionary can be replaced by any other word list and the output will work properly accordingly. <br/> Sample Input1 -> <br/> Helllooo''\n''i''\n''completely''\n''loooooove''\n''youuu''\n''!!!''\n''nooooo''\n''doubt''\n''about''\n''that''\n''!!!!!!!!''\n'';)''\n''<br/> Output2 (at the end of the processing) <br/> ^Helllooo/Hello$''\n''^i/i$''\n''^completely/completely$''\n''^loooooove/love$''\n''^youuu/you$''\n''^!!!/!!!$''\n''^nooooo/no$''\n''^doubt/doubt$''\n''^about/about$''\n''^that/that$''\n''^!!!!!!!!/!!!!!!!!$''\n''^;)/;)$''\n'' <br/> Our final aim is to -> reduce these words in a similar fashion as described above and then match them. <br/>It is to be noted that in the dictionary the abbreviations and acronyms should also be added externally. In many cases repetition such as <br/> “uuuu” is given which would standardize to “you” so “uuuu”->”u”->”you”<bt/> Hence abbreviation processing should always be after this step. Preferably at the end.<br />
**Punctuation repetition is not a problem for us. <br/> Since Apertium handles '''!!!''' similar to '''!''<br />
<br />
*'''Handling of Hashtags ->'''<br />
**'''Cases in Hashtags ->'''<br />
***Words are separated by Capitals <br/> For example, #ForLife -> For Life<br />
***Words are not separated by Capitals <br/> For example, #Fridayafterthenext <br />
**'''Solution''' - <br/> Hashtag disambiguation can be easily done by any of the two ways -> We need to break it into separate words by using recurring references to the dictionary or FST’s. I think the later will be much easier. <br/> It is Important to separate the words mentioned in the hashtags. Hashtags are supposed to convey the emotion or the summary of the tweet. Hence most frequent not in context to the grammatical surroundings. <br />
**So Words in hashtags should be represented as a ‘lone sentence’. <br/> Example, “Today comes monday again, #whereismyextrasunday” -> <br/> Today comes monday again. “Where is my extra Sunday”<br />
<br />
*'''Abbreviation and Acronyms ->'''<br />
**In the tweets by matching the most frequently occurring non dictionary words, I came up<br />
with the list of a few abbreviations.<br/> These are -> [http://web.iiit.ac.in/~akshay.minocha/abbreviations_english.txt English_abbreviations_list_non_standard] <br/> The solution to improve translation due to the occurrence of these is simple. <br/> When we know what their full form is, we can simply trade places as the final step of the processing towards standard input. <br/> Abbreviation of single character representations such as r->are, u->you, 2->to are also included. This list can be increased by further analysing the data. <br/><br />
<br />
*'''Spelling mistakes ->'''<br />
**These include spelling mistakes on purpose as well as the errors that arise due to vowel dropping. <br/> Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into another, with the allowable edit operations being insertion, deletion, or substitution of a single character. Although this algorithm worked well, and is also implemented by the PyEnchant library on python <br/> >> d = enchant.request_dict("en_US") <br/> >> d.suggest("Helo") <br/> ['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"] <br/> this was a bit non-accurate as it did not consider the "transposition" action which is defined in the following link - ( http://norvig.com/spell-correct.html). Peter Norvig in the spell correct link, shows us how easily we can build a spelling correction script by using a large standard corpora for a particular language. <br/> Building a spelling corrector for a language becomes easy be it by any of the above ways. It solves both the problems. <br/> Alternately from the large bag of words we can also probabilistically find out the most likely spelling for the word. <br />
<br />
*'''Apostrophe correction ->'''<br />
** There are some words where we can predict easily whether the apostrophe exists or not <br/> for example - theyll -> they’ll <br/> or im -> i’m <br/> but ambiguity exists in words like - > <br/> hell -> he’ll or hell ? <br/> shell -> she’ll or shell ? <br/> Here the apostrophe makes a difference in the total sense of the words as they are two completely different words. <br/> This can be improved by using the predicting mechanism discussed where the trigram probabilities of the text from the standard corpus will be compared and the results will be reported. <br/> List of apostrophe occurrences from a standard corpus collected by me earlier -> [http://web.iiit.ac.in/~akshay.minocha/apostrophe_list.txt List of apostrophe occurrences_standard_English]<br />
<br />
*'''Spacing and hyphen variation & optional hyphen -> '''<br />
***Since we are proposing a proper mechanism to figure out a solution. One way is to come up with the creation of a reference corpus ( either what apertium is currently using or we can come up with something real quick using the technique described in my paper) -> ( Feed Corpus: An Ever Growing Up-To-Date Corpus, Minocha, Akshay and Reddy, Siva and Kilgarriff, Adam, ACL SIGWAC, 2013. ) <br/> With this we can use a trigram based model( or higher n-gram) to predict the most probably occurring word. We can also train on the reference corpus to predict the word. <br/> After creating the Standard text, the only way to verify our level of success would be to check and compare our system against the other machine translation systems available like Moses, train them on different sets and check our accuracy. <br/><br />
<br />
The project in effectively important since non standard text is not handled by many MT systems, and it is important because we have to go with the trend of the language used today to convey the meaning intact to a different native speaker.</div>Ksnmihttps://wiki.apertium.org/w/index.php?title=User:Ksnmi&diff=47087User:Ksnmi2014-03-13T12:48:40Z<p>Ksnmi: Created page with "I'm Akshay Minocha [http://web.iiit.ac.in/~akshay.minocha] and my research interest lie in a variety of things - from Corpus Linguistics, Machine Translation, Translation Pro..."</p>
<hr />
<div>I'm Akshay Minocha [http://web.iiit.ac.in/~akshay.minocha] and my research interest lie in a variety of things - from <br />
Corpus Linguistics, Machine Translation, Translation Process Research, Semantic modelling and Sentiment Analysis. <br />
<br />
<br />
My project involvements and work experience can be found in my resume on the following link -> [http://web.iiit.ac.in/~akshay.minocha/Akshay_Minocha_CV.pdf]<br />
<br />
Interested at the moment to make the world a better place by identifying and correcting translations for the non-standard input used in abundance on social media and informal conversations.</div>Ksnmi