User talk:Khannatanmai/GSoC2020 Final Report

From Apertium
Jump to navigation Jump to search

Announcement Email[edit]

Hey guys! The markup handling project reached all of its goals this week. While it will continue to be improved, it is in a state that’s ready to be tested with real world data now.

We have updated so that if you translate a document/webpage on it, it will use the new system to translate it. still uses the old way to do it, so you can compare the translation outputs. The best way to check the impact of the project is to translate webpages with lots of links, markup, etc.

If you use Apertium from source,

apertium -f html/odt/docx/pptx -d . eng-spa

also uses the updated system with wblanks. You will also need Transfuse. Make sure apertium, lttoolbox, etc., are all at the latest commit for you.

Once this system has undergone real world testing, we can update the main as well.

I also want to explain very briefly what was done. Markup handling has been a problem in Apertium for a long time. It was done using superblanks that encapsulate markup information inside them during the translation process. This works well to protect the formatting of the document. However, languages represent information differently and during translation, words/phrases move around, get deleted, split, merge, etc. The markup information on the words needs to stick with the words, otherwise we end up with erroneous markup in the translation, which is what happened:

Spanish Input: <i>El perro</i> <b>blanco</b>
English Output: <i>The white</i> <b>dog</b>

As part of this project, a new kind of blank was proposed - a wordbound blank. It contains any information that needs to stay attached to a word/phrase during the entire translation process. After modifying most modules in the pipeline to work with these wblanks, writing new de/reformatters (transfuse), and adding markup in wblanks, the translation we have is:

Spanish Input: <i>El perro</i> <b>blanco</b>
English Output: <i>The</i> <b>white</b> <i>dog</I>

It should prove immensely useful for users of Apertium MT system to translate html or any formatted documents such as odt, docx, pptx.

For more details about the project, about wordbound blanks, and about the new way of doing markup handling, check out: Project Report, Development of wordbound blanks.

I’d like to thank Tino Didriksen for not only being an active mentor, but for participating in the project as well. A major chunk of this project - Transfuse, deformatters, reformatters, getting all of it integrated with Apertium, and lots more was done by him.

Hope this proves to be useful :))

Some more links to understand the problem:

Thanks and Regards, तन्मय खन्ना Tanmai Khanna