User:Rcrowther

Notes

If the language pair is used in reverse, lang Y -> lang X, then the monodix for lang Y works as an analyser (Left->Right/LR), the bidex works Right->Left, and lang X works as a generator (Right->Left/RL).

If you are creating a new pair, the necessary modules are the two monodix and the bidex. Other modules (e.g. Lexical selection, Chunker stages, Post Generator) are for refining translation results.

In the Wiki, you may find references to the Lexical Selection module being placed *before* the Lexical Transfer ('translation') module. This was the original position. The position of the module is now *after* the Lexical Transfer module. This decision is final (if software is ever 'final'...).

Many parts of the Wiki refer to Constraint Grammars (vislcg3 CG-3, sometimes HFST) for text disambiguation. These codebases can be used as modules in the Apertium workflow, but are not part of the Apertium project. They are maintained elsewhere. Also, the grammar information they use is sometimes maintained elsewhere. The modules would usually be placed after Morphological Analysis but before Lexical Transfer. Apertium pairs can be developed by inserting Context Grammars, but this would be unusual, as most of the same effects can be achieved using the Lexical Selection and/or Chunker modules. The Apertium modules are not as powerful as a Constraint Grammar (or need a lot of work to be that powerful), but offer much faster processing and are easy to read and maintain.

Also mentioned in the Wiki is a step 'POS Tagger'. Like a constraint grammar, this module was/is placed after Morphological Analysis but before Lexical Transfer. Like a constraint grammar, the POS Tagger is used for word disambiguation. However, a constraint grammar works by constructing rules which decide which word should be chosen. The POS Tagger works/worked by adding special tags to the incoming text, after which the module builds data by being 'trained' i.e. statistical analysis. Development in Apertium has revealed that the POS Tagger module, though powerful, offers little improvement in translation quality. The module has not been used in new pairs for some time.

The Post Generator module is sometimes referred to as an 'orthographical' analyser/text-modification module. The module was originally provided to convert Spanish-like 'de el' into 'del'. The word 'orthographical' suggests the Post Generator module can handle word compressions/apostrophes such as the common English form "John's house". However, when used in this way, the Post Generator module has several unexpected behaviours. Not bugs, but the module does not perform in a flexible way. The module continues to perform useful work converting 'de el' into 'del' or, in English, placing the correct form of 'a'/'an' determiners ('a house', but 'an apple'). However, the module is not suitable for general orthography. Apostrophes, for example, are often handled in a bidex.

Several modules can do the work of other modules. For example, the first chunker module, Chunker (sometimes called, confusingly, 'IntraChunk') is a very powerful module that can perform the work of the Lexical Selector. Indeed, in several language pairs Chunker rules do lexical selection. However, Chunker code is clumsy and difficult to read. The current Lexical Selection module is clean, fast, and offers what computer programmers call 'separation of concerns'. That is, if the Lexical Selector can do the work then, to help developers and future readers, the code is better placed there. The same is true of the analysing Monodix (which could do disambiguation work) and InterChunk (which could do Post Generator work).

User:Rcrowther

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools