Translating subtitles intro

From Apertium
Jump to navigation Jump to search

Using the Apertium engine to translate subtitles for films seems to be a very good use for the diverse range of languages and open configuration of Apertium.

This post starts with a general introduction to subtitling. If you know what you want, please skip the opening sections.

Subtitling films[edit]

If you know anything about digital film formats, you will know this is an area of many configuration options, and that, at base, some difficult mathematics and computing power are required for more advanced tasks. Compared to this, producing film subtitles is surprisingly easy.

But first, why would you wish to make subtitles at all? Films come with subtitles, but these are usually targeted for the region the film is distributed in. If you live in Europe, and have an interest in Japanese Manga/anime, you may be able to buy the DVDs, but will probably be offered a choice of Chinese or Japanese subtitles. And there will be some films and distribution media where the subtitles do not cover a language. Speakers of Gaelic may buy a film on DVD, legitimately and legally, targeted for their area (Great Britian) yet it is unlikely the film will offer a Gaelic translation.

Also, you may consider that many DVDs do not provide subtitles for the same language as the language spoken in the film. But studies have shown that same-language subtitles can have a big impact on literacy and second-language learning. And subtitles can also help people with visual or sound impairments.

For more on the interesting choices and nature of subtitling, see Wikipedia - Subtitle (captioning)

At the time of writing, there is much Open Source activity providing for these needs. A web search such as,

"open source film subtitles"

will find websites where subtitles for films can be posted and downloaded. So how is this done?

The format[edit]

Several formats have been used for subtitling films. However, the format you will meet often is called 'SubRip'.

SubRip is the name of a Windows program used for 'ripping' (reading) subtitles from Audio Visual footage. For details see the SubRip Website. But the name 'SubRip' is also used for the file format used by the program. The format is very popular.

There are several reasons why SubRip is popular, SubRip,

  • Can be located in a simple file, away from the Audio Visual file. For details, see the next section
  • Is text based, which means it can be edited with a text editor, or even (if nothing else is available), a word processor
  • Is Open Source
  • Has massive support e.g. in video players, and even in Firefox

The SubRip format[edit]

The SubRip format has been described as "perhaps the most basic of all subtitle formats" :)

SubRip format files usually use the file extension '.srt'.

The file, which is text, contains a set of stanzas which look like this,

00:20:41,150 --> 00:20:45,109
- I've got a new boat
- I'm sorry to hear that. What's her name?

00:20:41,150 --> 00:20:45,109
True Love 2.

Unofficially, the SubRip format also accepts a small group of HTML-like tags for text formatting, such as '....

The only unusual point is that each stanza must be separated from the next by a blank line.

Test encoding[edit]

One small issue with the SubRip format is the encoding of the text. Officially, the encoding should be UTF-8 (same as much Open Source encoding). However, many people are not aware of UTF-8 and, since the SubRip format has become popular, users are using Windows computers and programs which are not UTF-8 aware. So users have often generated files in 'Latin-1'. These will almost always work on playback, but can cause errors if loaded into programs for editing ('Character not recognised', etc.).

See the SubRip program on Wikipedia.

Using SubRip files[edit]

If you have a SubRip format file for a language translation, using it is very easy. Simply place the file in the same folder as the video, then play using a SubRip-aware Open Source player such as MPlayer or VLC. Even Windows Media Player can handle SubRip.

See this list, Wikipedia, SubRip Compatibility, and individual instructions for your media player.

For some people, this is not the solution they want! They want a DVD they can show to their family! The first thing to say here is that the way DVDs are organised is fairly complex, because they must include alternative soundtracks, navigation menus, and extras, as well as the main film/Audio Visual file. This almost always means using a DVD creation/burning tool. However, most DVD creators can handle adding SubRip translation files.

If you only need to create an AV file with translated subtitles, and are working in Open Source, then the well-known Handbrake program can do that easily.

Using Apertium to create a translated SubRip file[edit]

If you are reading this, then you have a special interest. You would like subtitles, and can find them, but they are not in the language you want. And Apertium has a language pair that can translate the subtitles. But how do you do it?

The best method would be if Apertium accepted the SubRip file directly to translate,

cat | apertium -nu eng-esp >

but Apertium has no input/putput formatting for SubRip. So we need to do a little more.

The Apertium project has generated several methods,

Apertium Subtitles[edit]

A Java program. Like other Apertium Java programs, it seems to be suffering issues in modern releases of operating systems. Howver, if you can get it running, it will work cross-platform. See Apertium Subtitles.

Use Gaupol and a plugin[edit]

See the next section for notes on Gaupol.

At the time of writing the plugin for Gaupol is very out-of-date. It may be re-written sometime... See Translating_subtitles.

Use Translation Tools[edit]

'Translation Tools' is a massive toolkit for translating files to and from the localising '.po' format.

At the time of writing, Translation Tools, in repository releases, has deprecated SubRip format handling. If you have an old verion, this is the tool for the job. See the Translating_subtitles.

By hand, or commandline tools[edit]

This is not as difficult as it suggests. All you need to do, see Format_handling, is to insert Apertium superblank markup, which is square brackets, into the SubRip file, to make Apertium ignore the number and time placements, e.g.

00:20:41,150 --> 00:20:45,109
- I've got a new boat
- I'm sorry to hear that. What's her name?
00:20:41,150 --> 00:20:45,109
True Love 2.

For one time, you could do this using a regexing text editor. Then run the result through Apertium, here using a pair 'eng-esp',

cat srcFile | apertium -un eng-esp > dstFile

Then remove the blank marks from the destination file. This can be done with any text editor,

00:20:41,150 --> 00:20:45,109
- I've got a new boat
- I'm sorry to hear that. What's her name?

00:20:41,150 --> 00:20:45,109
True Love 2.

That's it. Done. And, of course, if you like or are versatile with the commandline, you could create a 'sed' script, doing a similar action, to help do repeated job.

Machine translation, and fixing other errors[edit]

We now have a working subtitle translation workflow. For some needs, that will be all you need. Especially because Apertium translations, when kept within limits, are very accurate.

However, sometimes they are not accurate. And a purely automated workflow, like this, can produce unusual or wrong translations. Very good translations, even of subtitles, are corrected usually by human editing. To do this, you need a subtitle editor which can handle the SubRip file format.

Another problem you may encounter is that the subtitle file does not match the Audio visual file precisely. Perhaps the AV file is too short, so the subtitle file shows the subtitles too late. Or perhaps the subtitle file was not accurate to begin with, and you wish to correct subtitles which show too early, or too late. Again, to fix these, you need a subtitle editor.

There are many subtitle editors available, and subtitle editing is built into some major film editing programs. However, if you need to fix only the subtitle file, wish to work in Open Source, at the time of writing try Gaupol. Gaupol is a SubRip editor, is available in repositories, is easy to use, and can even run previews against AV files.