Apertium for Dummies

From Apertium
Revision as of 17:58, 19 January 2017 by Rcrowther (talk | contribs) (New general introduction page)
Jump to navigation Jump to search

Apertium is a piece of software, a program, for translating one language to another.

The central program of Apertium is what is called an 'engine'. It only talks in computer code, or takes input and output in forms very close to computer code.

If this sounds terrifying, please hang in! To use Apertium, you may not need to know anything about computer coding. How much you need to know depends on your interests. Building in Apertium needs an interest and an enthusiasm for languages, and a will to edit; with that, you are ready to go.

Perhaps you need to be convinced of the seriousness of Apertium? The project was founded initially to provide an English/Catalan converter. Since then, it has been funded by a government, been moved forward by Google Summer of Code bursts, has been the subject of a stream of academic papers, and is used round Wikimedia. Your question is better stated as, "Will Apertium meet my needs?" See further down this page, or roam through the Wiki.

The many uses of an engine

When you have an engine (electric, petrol...), it's nearly useless. All it can do is spin a shaft. An engine becomes valuable when you connect it to something. So it can drive a conveyor belt, or wheels on a car, or pump liquid round a refrigerator.

The Apertium engine is the same. Even people who know the insides of Apertium would not use the engine of Apertium for translation. We may be able to make it translate, but that is tedious. Instead, we join the engine to other software/programs. Then Apertium can translate where, and in the way, we would like it to.

That may sound odd. How can there be 'different ways' of translation? Well, here are some of the many, many ways that Apertium has been used for translating,

  • as a plugin to tools for professional translators
  • behind an application on mobile phones
  • during IRC (internet chat)
  • for film subtitles
  • driving webpages (e.g. the front page of the Apertium website)

...and many, many more. When you think about it, there are many places and ways translation is used. To see more of the ways Apertium has been used, see Tools.

The (not so) complicated construction of the engine

The central engine of Apertium is not one program. This can come as a surprise to some people, who think a computer program is a like a book. Computer people sometimes use the word 'monolithic' to describe this idea. But Apertium is not a book. It is more like a series of graphic novels.

When you look closer, Apertium is eight programs joined together. Sometimes, more or less programs are used. The structure is deliberately loose so that other programs can be inserted into the chain. For example, in the past, programs have been inserted into the main chain then removed. Yes, that is true, the chain is not same as it was when Apertium started, back in the late 2000's.

These central programs on the chain are often called 'lt-tools'. If you see that name in the documentation, or in downloads, now you know what it means. To see the current chain, look at the workflow diagram.

So what joins these programs together? Nothing more than an agreement. Each program will accept text in an agreed form, and output text in (generally) the same same form. For some detail, see Apertium_stream_format.

If you have never seen computer programs working like this, passing text from one program to another, then the process can seem loose, strange, even 'unprofessional'. Surely it will break? All I can say is that your computer probably starts itself by reading configuration files, then skipping from program to program passing information in this way. And you can see advantages. To change how Apertium works, you need only to edit text files. You could, if you had no other way, edit using a word processor (for various reasons, nobody will recommend that...).

And languages...

The central engine is no use without information. To translate, Apertium uses large configuration files called 'dictionaries'. When words come in to Apertium, they are recognised ('analysed') by comparing to a dictionary of words. Then the words are translated by a bilingual dictionary. Then the translated words are sent to a third dictionary which makes ('generates') the final result.

When dictionaries are gathered together to translate from one language into another, they are called a 'pair'. On the Apertium website, everywhere, you will find this word.

These dictionaries must be created. And this is where the people interested in language start. Because a very small Apertium dictionary, which will translate a word or two in most sentences, has about 200 words. A dictionary which covers the most frequent words in a language usually has about 800 words. And a dictionary which will cover most words in common usage will have about 2000 words.

Beyond that, translating by using a dictionary is usually not very accurate. To make a good translator, you will need to tackle words that do not translate with the same meanings. This is the problem of ambiguity. Also you will need to make specialized translations of groups of words, and reorder words. These improvements to the translated text are made using configuration and coding. To know more about them, see the section after the next section.

For people who want to translate

I'm sorry, but there is no super-powerful all-in-one editor that will let you work on any-computer-in-the-world to edit Apertium files using point-type-and-click. Attempts have been made, and maybe a super-editor will appear, but you need to consider that such a program would be difficult to maintain. And it sometimes seems to be a rule of computing that coding which works anywhere, works badly on everything.

But much time, thought and enthusiasm has been spent on Apertium to make the workflow easy. If you can handle looking at HTML, then you can handle looking at XML. XML is the language Apertium dictionaries and rules are written in. XML is not a pretty language, and writing code in XML is tedious. But at least the Apertium programs are consistent. Once you have learned how to make an XML dictionary, then you will be able to make a language pair. And you will be able to edit an Apertium dictionary and code.

And Apertium provides a collection of powerful tools. I'm not going to explain here about apertium-viewer, or the introduction to modes, the manual, or the many other tools and help you can get. But Apertium is well-equipped or, as computer people like to say, a mature project.

Translation is not a dictionary

Let's take a sentence in casual French,

Donc, c'est bon, non?

If we translate this into English by translating each word, we get,

Thus, it is good, no?

We can guess from this result the intention of the original text. However, it is not good English. Sometimes a translation by this method is so bad it makes no sense. It can be easy to fool translators. For example, try this (common, but eccentric) English-American sentence in some translators,

table the motion

and see what happens.

However, nearly every good translator will make a better translation of the French sentence than the result above. How do they do this?

First, let's look at what can go wrong.


Handling ambiguous words is one of the differences between a dictionary and a good translator. The Apertium project has a little hint-rule; if one word will cover the source words, use it. So the English have many words for rain ('mizzle', 'spit', 'downpour'...), but the French for rain 'il pleut' can be used for all of them.

But sometimes the target language has many words for one word. So the word will not 'disambiguate' easily. The English use the word 'weary', which can mean tired of someone, lacking spirit, or physically tired ('exhausted'). French (the language of precision?) has different words for all these meanings. And a good translator must choose. So if you make a good language pair, you need to tell Apertium how and where to use each word.

If you'd like to see the current Apertium solution (for many situations, though not all), try the lexical selection module. This module is one of the easiest modules to understand in Apertium, a later addition made after much theorising and practice.

Word groups, word reordering

I'm joining these ideas together. If you get anywhere with Apertium you, like a good linguist, will find many categories within this idea.

Let's look at that sentence above,

Donc, c'est bon, non?

In English, you would be briefer than that. Also, a good translation would not translate through a dictionary to 'it/this'. A good translation would use the word 'that', would be short, and would not ask a question,

That's good.

If you are interested, have a look at multi-words. Or the chunker modules, Chunker, Interchunk and Post-chunk, in an introduction to the chunker module. Be warned, the chunker stages are powerful, and use computer code. But you do not need them to make a start.

Right. That covers what can go wrong with a simple dictionary translation.

How a (rule-based) translator spots difficult translations

Here is a sentence used to teach American-English,

The cat sat on the mat

A human being, like you and me, would start to understand this sentence in complex ways. We have an idea of what makes a cat, "small, furry, sharp claws". We probably remember a cat in this way. If we memorise the translation in another language, we probably construct a complex series of associations.

Computers are bad at this kind of information gathering. Slow, and not sophisticated. But they are good at information ordering. They can put a word into a category, then look at the next word. It's as simple as that. And a lot of errors can be spotted this way. It's not the way a human being works, but it is effective. It is the base of how Apertium (though not all other translators) spots problems.

Let's look at the sentence above, which can cause trouble for translators,

table the motion

The method given above will spot that this sentence is unusual. If 'table' is a thing (a 'noun'), then it can not be followed by words that are another thing ('noun'), 'the motion'. The translator may not know what to do about fixing this, but it can see that there is a problem.

Errors can be spotted this way, and they can be fixed this way. Apertium is a rule-based translator, and so the language pairs fix many problems by looking at the surrounding words. The lexical selection module (which can fix ambiguous translations) works by looking at surrounding words. And rules in the chunker stages (which can rearrange word order) mostly work like this too.

If you are very, very interested in this, you could start with the Chomsky Hierarchy. However, there is no need to know about this to contribute or build in Apertium. You need other skills, like a love of language and a feeling about how language works. You do not need complicated mathematical notation.

But there is another method to fix translation errors, and it is worth talking about.

Translation methodology

The units of translation

The first translator programs analysed text by breaking into words. Nowadays, some translators do not do this. It has been found that translators can be often more efficient and effective if they do not deal with text-as-words. They may also use sentences, phrases, or units of sound. The units may not be parts of old-style grammar.

Apertium, a rule-based translation engine, as default, breaks into words. But it can handle other units. It has built in features for multi-word units, and can be very good at handling word re-ordering.

From here on, I'll use a muddle of words to refer to the idea that a translator may not always translate using one word. But Apertium mainly uses words.

If you are interested in the theory (which you will sometimes see mentioned in the Apertium maillists or on the Wiki), try N-grams. This dives into the names linguists use in their day-jobs. It is intimidating, but you will then have a hint of what they are talking about.

Two approaches

Speaking in general, there are two approaches to making a translation. The first is called 'statistical analysis' and is used by Google Translate, Microsoft's translation platform, and others. Texts in one language, and correct translations in another, are fed into a program. The program analyses the texts, works out which words (or sentences/phases) are joined together, then the chances that words (or sentences/phrases) are translated in one way or another. The analysis is given to the translator. When a translator sees incoming text, it tries to spot the sequences of text, then looks at the statistical results to make a guess at the translation.

The second method is called 'rule-based'. People work out rules for which words can be near each other. They tell the translator the rules. This is the method used by Apertium.

The two methods give different kinds of translations. Statistical translators,

  • are 'smooth'. They try to translate, and make many guesses
  • are tedious to make. You can wait days for the texts to be processed (for the translator to be 'trained')
  • need a lot of material. Google reckon a good translation requires about 2,000,000 translated words (they got much of their core material from an agreement with the United Nations)

Rule-based translators,

  • are 'accurate'. When they work, they are correct. When they go wrong they can be amusing
  • require effort and knowledge to make. You need to write the dictionaries and rules. You do need to be able to speak the language, or have someone to help, to make a good pair
  • can be created with little information, helped by ingenuity and literary skills. They can be updated quickly

Most translation engines have a trace of both methods. Statistical engines such as Google Translate need some rules to generate statistics. And Apertium had a statistical module for disambiguation. That is now removed from the main chain, yet the 'weighing' feature in the lexical selector is arguably a kind of statistical selector.

Incidentally, if you want to experiment with statistical translation, the University of Edinburgh develop a powerful Free Software translation engine called |Moses.

After this description, I think you can see that it is no surprise Google use statistical translation. Yet there is a place for rule-based translation, and rule-based translation is arguably a more suitable methodology for agile Open Source projects.

Final thought, the minority languages of the world

Sad but true: in our world, languages, like the natural environment, are disappearing. In the last two hundred years or so, languages have become extinct, and many are under threat.

If you would like to see lists, there are official lists; check the lists of endangered languages. So do not fool yourself, there is an endangered language near you. I come from the UK, where all island languages, most of which are not from the same origins as English, are officially 'endangered'. English aside, all UK national languages are at least 'vulnerable'. The languages of the West, North, and Lowland Scottish, in some places developed away from 'standard' English, are now ruins. If you are feeling philosophical try, Tyranny of the Majority, or On Liberty by John Stuart Mill.

Apertium was originally created for the translation of Spanish to Catalan, and Catalan to Spanish. But hard work on simplifying the coding techniques, the speed of production of language pairs (including very small pairs), and the generosity of the organisation, means Apertium development has continued. And the project has found a place. While commercial translators, and translators powered by great resources such as Google Translate have a major position, the Apertium project continues to give birth to unusual, threatened, or in one recent example, crisis-driven translation pairs. If you need a top-quality translation engine with a startlingly wide array of implementations for usage, or if you simply want to make a pair of your own, Apertium is a good choice.