Difference between revisions of "User:Littleowl/Littleowl ff"

From Apertium
Jump to navigation Jump to search
(→‎References: new references)
(it isn't a draft anymore)
Line 1: Line 1:
'''Apertium: Format filters (LaTeX)''' - DRAFT
+
'''Apertium: Format filters (LaTeX)'''
   
 
[http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/littleowl/t127056958979 GSoC on-line application]
 
[http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/littleowl/t127056958979 GSoC on-line application]

Revision as of 17:24, 9 April 2010

Apertium: Format filters (LaTeX)

GSoC on-line application

Abstract

My proposal is to facilitate the translation of LaTeX documents through the Apertium project, generating the Apertium-deslatex and Apertium-relatex between the LaTeX format specification and the Apertium stream format.

I have excluded the MediaWiki format from my proposal due to the complexity and difficulties expressed in the mailing list of the project. However, I would be very keen to include other formats in my proposal such as PDF as long as it would fit into the task and its deadlines. Content:

Content

Name: Carles Sanz Casañas

E-mail address: carles.sanz@pangea.org

Other information that may be useful to contact you:

Why is it you are interested in machine translation?

I live in Catalonia where there is two official languages, Catalan and Spanish. Therefore, documentation is always in either Catalan or Spanish or even in English for international purposes. I believe that Machine Translation Systems are key tools in order to improve the communication within and between Organizations in Catalonia and over the World.

Why is it that you are interested in the Apertium project?

I am really interested in Apertium because is an open-source platform for the purpose mentioned above. And I also like the democratic spirit of open-source projects I would be very excited to take the opportunity to collaborate on this kind of project.

Which of the published tasks are you interested in? What do you plan to do?

Title

Format filters (LaTeX)

Why Google and Apertium should sponsor it

It will make Apertium capable of dealing with at least another different format: files marked up with LaTeX.

Apertium uses its own format in order to translate documents, the Apertium stream format[1]. Apertium can currently deal with texts in RTF, HTML, DOCX, WXML, PPTX, XLSX, XpressTag and ODT format by means of a format definition file.

For example, in order to deal with HTML there is an application or scripts following a specification to de-formatter from HTML into Apertium stream format (Apertium-deshtml) and re-formatter again into HTML (Apertium-rehtml)[2].

The main idea of this proposal is to develop the same structure and rules for LaTeX documents creating the Apertium-deslatex and Apertium-relatex between the LaTeX format specification[3] and the Apertium stream format.

How and who it will benefit in society

Apertium can currently deal with many formats as mentioned before, however it cannot deal with LaTeX yet. LaTeX is widely used in the academic and the commercial world, and other professionals[4]. It would allow them to translate automatically all their documents through Apertium System.

Work plan

LaTeX uses a markup language in order to describe document structure and presentation. What LaTeX does is convert the source text, combined with the markup, into a high quality document. For the purpose of analogy, web pages work in a similar way: the HTML is used to describe the document, but it is the browser that presents it in its full glory - with different colours, fonts, sizes, etc.

Having said that, there is more than one way to resolve this task. My aim with this proposal is to state two different options. Leaving to the Apertium Community and mentors the consequent debate of the final implementation. I am happy with both solutions.

Option 1

The first option is the ideal one. It consist in the implementation of a full parser of LaTeX to Apertium stream format and vice versa. It includes two main sub-tasks, the deformatter and the reformatter.

Nevertheless, the scope of this option is too big due to the complexity of the markup language. Therefore my proposal is to focus on a core subset of LaTeX. Future tasks within Apertium project would fulfill all features of the LaTeX markup language.

I would exclude LaTeX documents with mathematics, algorithms, pseudocode, creation of graphics, error or warning messages, etc. And I would focus in the following core subset initially:

  • Input File Structure (document, begin, end)
  • Spaces
  • Special characters
  • Comments
  • Text formatting[5]
  • Page layout[6]
  • etc.(*)

(*) The list of features needs further study and feedback from the Apertium Community scheduled at the beginning of the task (60-90 hours)

Option 2

The second option consist in reuse Apertium previous developments and also existing open-source tools compatibles with Apertium license.

Due to previous work with HTML deformatter/reformatter and also the wide work with parsers within open-source community, I propose to reuse all this know-how and create both LaTeX deformatter and reformatter from previous developments. Therefore, the structure proposed for the format filter is the following:

  • Apertium-deslatex:
    1. From a LaTeX file and a latex2html parser[7][8] generates an HTML output
    2. Using Apertium-deshtml obtain the source document for Apertium translation engine
  • Apertium-relatex:
    1. The Apertium-rehtml creates an HTML file from the targeted document.
    2. And using an html2latex parser[9] generates the final LaTeX output.

It perhaps won't generate the desirable output due to external parsers (formatting issues) although it will translate full LaTeX documents easily. I believe this methodology allows the implementation of more format filters within the scope of the task or gives enough time to refine internal and external resources involved in order to fix formatting or page layout issues.

Week plan

The generic work/week plan for both options includes a period of study at the beginning of the task, implementation of the format filter or format filters, testing and documention.

  • Week 1: Study of Apertium stream format.
  • Week 2: Analysis of existing de-formatter and re-formatter examples
  • Week 3: Specification of Format filter to be added (LaTeX)
  • Week 4: Basic integration of a new format filter into Apertium

Deliverable #1 Goal: Specification and analysis of the LaTeX format and its basic Apertium integration

  • Week 5: Integration of new format filter
  • Week 6: Integration of new format filter
  • Week 7: Integration of new format filter
  • Week 8: Pre-release

Deliverable #2 Goal: Integration of full format filter specified previously

  • Week 9: Testing
  • Week 10: Testing
  • Week 11: Documentation: User guide
  • Week 12: General documentation of the project

Project completed Goal: Release of format filters deformatter and reformatter

List your skills and give evidence of your qualifications

I am Computer Scientist and Engineer by the Barcelona School of Informatics (www.fib.upc.edu). I also have a Postgraduate degree in Open-Source by the Technical University of Catalonia Foundation (www.fundacio.upc.edu)

Currently I am an student of Master in Business Administration by the IESE Business School in Barcelona. Previously I worked in an Open-Source company for two years. My aim doing the Master is to further develop my business administration skills in order to collaborate successfully to the Open-Source community from the private sector in the future.

I have strong background in Open-source projects. On one hand my final degree in the Barcelona School of Informatics was an Open-Source project which was awarded by the Catalonia Government and the Computer Science and Engineering Association of Catalonia. On the other hand I worked in an Open-Source company for two years where, among many other projects, we made the first migration of a Council to Open-Source in Catalonia. During these periods I gained excellent skills with Script Languages such as Perl or PHP and Web development. I also worked with LaTeX for academic purposes.

List any non-Summer-of-Code plans you have for the Summer

Currently I am unemployed and looking forward to collaborate with the Summer-of-Code project before the Master begins again this September. Therefore I have full availability to participate in this task of the GSOC.

References

[1 ] Apertium Project. Apertium stream formant. http://wiki.apertium.org/wiki/Apertium_stream_format

[2] Apertium Project. Format handling. http://wiki.apertium.org/wiki/Format_handling

[3] Wikipedia. LaTeX. http://en.wikipedia.org/wiki/LaTeX

[4] The Comprehensive TeX Archive Network (The CTAN team). What are TeX, LaTeX, and friends?. http://www.ctan.org/what_is_tex.html

[5] Wikibooks. LaTeX Formatting. http://en.wikibooks.org/wiki/LaTeX/Formatting

[6] Wikibooks. Page Layout. http://en.wikibooks.org/wiki/LaTeX/Page_Layout

[7] Blog of Craig Small. Latex 2 html tools. http://enc.com.au/docs/latexhtml.html

[8] Blog of Jason Blevins. Tools for converting latex to xml. http://jblevins.org/log/xml-tools

[9] SourceForge. html2latex. http://html2latex.sourceforge.net