User:Rcrowther

From Apertium
Revision as of 16:33, 19 December 2016 by Rcrowther (talk | contribs)
Jump to navigation Jump to search

(to be placed below 'Apertium for Dummies' diagram?)

Notes

If the language pair is used in reverse, lang Y -> lang X, then the monodix for lang Y works as an analyser (Left->Right/LR), the bidex works Right->Left, and lang X works as a generator (Right->Left/RL).

If you are creating a new pair, the necessary modules are the two monodix and the bidex. Other modules (e.g. Lexical selection, Chunker stages, Post Generator) are for refining translation results.

In the Wiki, you may find references to the Lexical Selection module being placed *before* the Lexical Transfer ('translation') module. This was the original position. The position of the module is now *after* the Lexical Transfer module. This decision is final (if software is ever 'final'...).

Many parts of the Wiki refer to Constraint Grammars (vislcg3 CG-3, sometimes HFST) for text disambiguation. These codebases can be used as modules in the Apertium workflow, but are not part of the Apertium project. They are maintained elsewhere. Also, the grammar information they use is sometimes maintained elsewhere. The modules would usually be placed after Morphological Analysis but before Lexical Transfer. Apertium pairs can be developed by inserting Context Grammars, but this would be unusual, as most of the same effects can be achieved using the Lexical Selection and/or Chunker modules. The Apertium modules are not as powerful as a Constraint Grammar (or need a lot of work to be that powerful), but offer much faster processing and are easy to read and maintain.

Also mentioned in the Wiki is a step 'POS Tagger'. Like a constraint grammar, this module was/is placed after Morphological Analysis but before Lexical Transfer. Like a constraint grammar, the POS Tagger is used for word disambiguation. However, a constraint grammar works by constructing rules which decide which word should be chosen. The POS Tagger works/worked by adding special tags to the incoming text, after which the module builds data by being 'trained' i.e. statistical analysis. Development in Apertium has revealed that the POS Tagger module, though powerful, offers little improvement in translation quality. The module has not been used in new pairs for some time.

The Post Generator module is sometimes referred to as an 'orthographical' analyser/text-modification module. The module was originally provided to convert Spanish-like 'de el' into 'del'. The word 'orthographical' suggests the Post Generator module can handle word compressions/apostrophes such as the common English form "John's house". However, when used in this way, the Post Generator module has several unexpected behaviours. Not bugs, but the module does not perform in a flexible way. The module continues to perform useful work converting 'de el' into 'del' or, in English, placing the correct form of 'a'/'an' determiners ('a house', but 'an apple'). However, the module is not suitable for general orthography. Apostrophes, for example, are often handled in a bidex.

Several modules can do the work of other modules. For example, the first chunker module, Chunker (sometimes called, confusingly, 'IntraChunk') is a very powerful module that can perform the work of the Lexical Selector. Indeed, in several language pairs Chunker rules do lexical selection. However, Chunker code is clumsy and difficult to read. The current Lexical Selection module is clean, fast, and offers what computer programmers call 'separation of concerns'. That is, if the Lexical Selector can do the work then, to help developers and future readers, the code is better placed there. The same is true of the analysing Monodix (which could do disambiguation work) and InterChunk (which could do Post Generator work).

(example from a new proposed page 'Apertium workflow reference', or similar title)


Lexical selection

Used later in development, to patch ambiguous translations.

Choose from possible translations, using matches on the surrounding words. If the dictionary has defaulted a translation, configuring this module has no purpose. But if the bilingual dictionary is ambiguous, this module can make better decisions than a simple default.

For tricky situations, the lexical module allows weighing the rules to guess.

Example

The English use the word "know". They have other words ("understand"), but they use "know" widely. Translated into French, "know" becomes several words, "savoir", "connaître", "reconnaître", "s'informer" (and more). The translator must chose. But "savoir", in French, is not "reconnaître" (if the translation was the same, a default can be set in the bilingual dictionary).

But we know if what follows is a pronoun, or a proper noun, a person or thing, then "know" probably means "connaître", not "savoir" (the situation can be more complex than this, but this is a start). The lexical analyser allows us to write rules like, 'if "know" is followed by a pronoun, translate as "connaître"'.

Note that the rules match the target language, but select from the source language.

Typical stream output

???

Tool used

To compile a file of rules,

lrx-comp

This tool runs the compiled lex rules,

lrx-proc -b

Auto Mode

xxx-yyy-lex.mode

Configuration Files

Found in a bilinual directory,

apertium-xxx-yyy.xxx-yyy.lrx

compiles to,

xxx-yyy.autolex.bin

Links

Constraint-based lexical selection module