Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

User:Rcrowther

From Apertium
(Difference between revisions)
Jump to: navigation, search
m
m
Line 1: Line 1:
  +
[[Installation (français)|En français]]
  +
{{Main page header}}
   
'''(proposed page 'Apertium workflow reference', or similar title)'''
+
== To try Apertium ==
  +
You can go online to the [https://apertium.org front page] :)
   
= Workflow reference guide=
+
There are several applications which work from the desktop without full installation. For these and more graphical user interfaces, services, plugins, etc. goto [[Tools]].
This is not a guide to every Apertium feature. Some features default or are simple-configurable, such as input and output formatting. Other features are advanced/optional, such as using a/the Constraint Grammar. These have been removed for clarity.
 
   
It is a guide to the overall structure, with references to the workflow.
+
If you would like install instructions for 'Apertium viewer', 'apy' (the Apertium server) etc. got to [[Tools]]. The install instructions can be found with the tool descriptions.
   
Apertium functions as a set of modules called 'lt-tools'. Each module is loosely connected to the next by a text stream, see [[Apertium_stream_format]] (add a general introduction somewhere else??? R.C.).
 
   
Following the introductions to each module is a technical description. These describe the action of each module with more precision. They may also introduce more technical language which linguists and/or computer coders would use. They are also terser :)
 
 
Modes are a newer feature in Apertium. Modes are scripts which group modules together using very common configuration. Each script processes text through several modules, but not all. The last module in the chain is configured to generate useful/debugging output. So, for example, you can use the 'Chunker' mode script, to see what the output is like after the 'Chunker' module. The modes listed here are as found in a bilingual dictionary folder. For more information, see [[Modes introduction]]. Another way to view how <code>apertium</code> is translating text is to use the Java tool [[Apertium-viewer]] (this tool is able to run the modes, presenting the results from various stages in the translation workflow visually).
 
   
References to 'xxx' and 'yyy' refer to a language code, for example 'en-es'; 'English' to 'Spanish'.
+
== For those who want to install Apertium locally, and developers==
  +
How to install Apertium core<ref>Apertium is a big system. There are many plugins, scripts, and extension projects. The core, the code which translates, is a multi-step set of tools joined by a stream format and, nowadays, invoked by scripts called 'modes'. You may also see the names 'lt-toolbox'/'lt-tools', 'apertium-lex-tools', and the simple title 'apertium'. These refer to groupings of the tools.
   
  +
Packaged or compiled, these tools can be installed as one unit. From here on, we call them 'Apertium core'.
  +
</ref> and language data on your system (developers may also want to consider their operating environment<ref>
  +
Apertium is written to be platform-independent. However, it can be difficult to maintain platform-independence over a project this wide. If you intend to do something deep with Apertium, you will gain more help from the tools if you use the [http://ubuntu.com Ubuntu], or a similar Debian-based, operating system.
   
  +
In no way does this mean that the Apertium project favours this platform.
  +
</ref>).
   
==The workflow==
 
===Morphological Analyser===
 
Reads text from an input (maybe after a deformatter). The text is analysed and converted into Apertium stream format.
 
   
Words in the text stream are checked against the dictionaries. If a word is recognised, the word is reduced to what is called a 'lemma'. This is a fixed representation of the word. For small and unique words, the lemma is the same as the word itself.
+
===Installing: a summary===
  +
Most people will need to,
   
The recognition of words can be widened by the use of a paradigm. The English verb 'buy' may be recognised, in the source text, as the surface forms 'buy', 'buys', 'bought'. Now you see the reason for the 'lemma'; whatever the original form, the analyser reduces the word to the lemma 'buy'. Then the dictionary/analyser adds a set of tags to the lemma to explain how the word 'buy' was used---if it was in the past tense ('bought'), if it was attached to a third person ('buys'), and so on. The lemma and tags are rendered to the stream.
+
====Install Apertium Core by packaging/virtual environment====
  +
* Linux systems: [[Install Apertium core using packaging]]
  +
* Windows and Apple systems: [[Apertium VirtualBox]]
   
For analysis, the dictionary is used in the direction Left/right. 'surface form' -> 'lexical unit'.
+
==== For translators: Install language data/dictionaries/pairs from repositories ====
  +
[[Install language data using packaging]], including hints about the private repository.
   
====Technical Description====
+
==== For language developers: Install language data/dictionaries/pairs by compiling ====
Segment the text (expanding elisions, marking set phrases etc.) and look up the segments in the dictionary, write the baseform ('lemma') and tags for all matches (lemma and tags together are called a 'lexical unit').
+
* Start a new language pair: [[How to bootstrap a new pair]]
  +
* Work on an existing language pair: [[Install language data by compiling]]
   
   
====Example====
 
: "AN ANT is a wise creature for itself"
 
   
Is broken into words (square brackets for illustration only). This is a simple example with no complexity,
+
===Alternatives===
   
: "[AN] [ANT] [is] [a] [wise] [creature] [for] [itself]"
+
====Installing Apertium core by compiling====
   
then analysed. Note in the stream output below the dictionary recognises all these words. For example, 'is' as a surface form of the verb 'to be',
+
Apertium maintains a private repository that is up-to-date and reliable. If you do not want to work in core, or develop languages, please use either packaging or a virtual environment. The packages stay up-to-date and are stable. A compile will waste your time.
   
: is/be<vbser><pres><p3><sg>
+
However, if you are planning to work on Apertium core, or have an operating system not covered above, go right ahead, [[Install Apertium core by compiling]]<ref name="about installing">Most people know the word 'install'. It means 'put code in my operating system'. When developing, it is not usual to fully 'install'. You get the code working enough to get results.
   
and 'creature' as a surface form of the noun-group 'creature'/'creatures', written as the lemma 'creature',
+
This is relevant to Apertium, which needs a rapid cycle for re-compiles. If you follow instructions to compile code, you will be discouraged from 'installing' builds. When we use the word 'install', we mean 'get code working on my computer'.</ref>
   
: creature/creature<n><sg>
+
== Notes ==
  +
<references/>
   
This is the English dictionary, and mature. Let us say that the dictionary did not recognise the word 'ant' (it does, but we pretend). The morphological analysis would output the word with the 'not-recognised' symbol, '*' ,
+
== Installation Videos ==
   
: ^ant/*ant
+
Most of these videos have been produced by Google Code-In students.
   
The Apertium default is to pass unrecognised words straight through.
+
* Using Apertium Virtualbox on Windows: https://youtu.be/XCUWMCJkRDo
  +
* Installing Apertium on Ubuntu (Romanian, English): https://www.youtube.com/watch?v=vy7rWy2u_m0
  +
* Ubuntu'ya Apertium Kurulumu / Apertium installation on Ubuntu (Turkish, English subtitles): https://www.youtube.com/watch?v=I__-BiQe7zg
  +
* Apertium on Slitaz (English): https://youtu.be/fCluA03oIXY
  +
* How to Install Apertium On Macintosh: https://www.youtube.com/watch?v=oSuovCCsa68
   
====Typical stream output====
+
[[Category:Installation]]
<pre>
+
[[Category:Documentation in English]]
^An/a<det><ind><sg>$ ^ant/ant<n><sg>$ ^is/be<vbser><pres><p3><sg>$ ^a/a<det><ind><sg>$ ^wise/wise<adj><sint>$ ^creature/creature<n><sg>$ ^for/for<cnjadv>/for<pr>$ ^itself/itself<prn><ref><p3><nt><sg>$^./.<sent>$
 
</pre>
 
 
====Tool used====
 
: lt-proc
 
 
Note this option/switch,
 
 
:'-a', --analysis (default behaviour)
 
 
For this stage, you will often see this option used (the mode scripts all configure with this switch),
 
 
: -w, --dictionary-case: use dictionary case instead of surface case
 
 
====Auto Mode====
 
In a monodix,
 
 
: xxx-morph
 
 
Other modes exist which output results after the 'POS Tagger' or 'Constraint Grammar' modules are used. If these modules are not used (they are unusual), these modes will output the same as the 'morph' mode.
 
 
In a bidex,
 
 
: xxx-yyy-pgen
 
 
There is also a mode 'pretransfer'. The Pretransfer module adds some fixes before the bidex. The output will often be the same as 'pgen'.
 
 
====Configuration Files====
 
The monolingual dictionary,
 
 
: apertium-xxx-yyy.xxx.dix
 
 
====Other Files====
 
The .acx file is used to define equivalent characters (and can grab apostrophes). Not necessary for making a basic pair,
 
 
: apertium-xxx-yyy.xxx.acx
 
 
====Links====
 
* [[Monodix_basics]]
 
* [[Apertium_New_Language_Pair_HOWTO]]
 
* [[Acx]]
 
 
 
 
 
 
 
 
===Lexical transfer/Translation/bidex lookup===
 
The translation.
 
 
The incoming stream of lemmas/basewords is looked up in the bilingual dictionary. Tags can be matched, but most dictionaries simply match a source lemma to a destination lemma.
 
 
Bilingual dictionaries are *mostly* symmetrical (it is possible to define special cases). They are used in the direction of translation. For our example, the en-es dictionary, the dictionary would be used left->right to translate English to Spanish, and right->left to translate Spanish to English.
 
 
====Technical Description====
 
Look up source-language baseword to find target-language equivalent (i.e. map SL to TL)
 
 
 
====Example====
 
The source text,
 
 
: "I knew a counsellor"
 
 
has now been analysed and original surface forms are gone, replaced by lexical units (an 'analysis'),
 
 
: I(pronoun, singular) know('know', past tense) a(determiner, singular) counsellor('counsellor', singular)
 
 
In the stream,
 
 
: ^prpers<prn><subj><p1><mf><sg>$ ^know<vblex><past>$ ^a<det><ind><sg>$ ^counsellor<n><sg>$^.<sent>$
 
 
(note how the word 'knew' has now become the lemma 'know', marked as past tense)
 
 
The lexical units are now looked up in the bilingual dictionary, and translated. At this point, the translation is into lexical units, not surface forms. The word 'know' is looked up, not 'knew', and we get the translated lemma (Apertium has marked that in a later stage the verb must be put in the past tense). So the Spanish bilingual dictionary generates,
 
 
: (mark/tags for the singular pronoun, because Spanish doesn't translate with this word) saber('know', past tense) uno(determiner, singular) asesor('counsellor', singular)
 
 
====Typical stream output====
 
This is 'biltrans' mode output, which can be overwhelming, but shows the bidex action in detail,
 
 
<pre>
 
^prpers<prn><subj><p1><mf><sg>/prpers<prn><tn><p1><mf><sg>$ ^know<vblex><past>/saber<vblex><past>/conocer<vblex><past>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ ^counsellor<n><sg>/asesor<n><GD><sg>$^.<sent>/.<sent>$
 
</pre>
 
 
====Tool used====
 
This stage is preceded by 'apertium-pretransfer', a tool that runs some fixes on tagger and other input,
 
 
This tool runs the bilingual dictionary transfer,
 
 
: lt-proc -b
 
 
The option/switch,
 
 
: -b, --bilingual: lexical transfer
 
 
====Auto Mode====
 
Tells you exactly what is going into the bilingual dictionary,
 
 
: xxx-yyy-pretransfer
 
 
Displays both the source and translation (informative, but overwhelming for text of any length),
 
 
: xxx-yyy-biltrans
 
 
====Configuration Files====
 
The bilingual dictionary,
 
 
: apertium-xxx-yyy.xxx-yyy.dix
 
 
====Links====
 
* [[Apertium_New_Language_Pair_HOWTO]]
 
* [[Bilingual_dictionary]]
 
 
 
 
 
===Lexical selection===
 
Used to patch ambiguous translations.
 
 
Configuring this stage is not necessary for making a basic pair.
 
 
The module chooses from possible translations by using matches on surrounding words. Generally, it is recommended that if you can find a word which covers several source-language originals, use that. But sometimes a preceding analysis translates into nonsense. This can be because the source language uses a word in an unusual way, or there are many translation possibilities. This module can fix such errors.
 
 
For tricky situations, the Lexical Selection module allows weighing the rules to guess at a translation (this is crazy linguistic programming, but the option is there).
 
 
The Lexical Selection module, and it's position after the bidex translation, is the result of a lot of experimentation with Apertium. The Constraint Grammar module, and the POS Tagger, were other attempts to perform disambiguation, see [[]]. However, the Lexical Selection module, though it's action is crude, covers nearly all necessary disambiguation cases. It is also much faster, and easier to read.
 
 
Note: For those making a new language pair, the monodix and bidix must recognise the alternative word forms to be chosen, or the Lexical Selector will be unable to select. The module needs a lemma; it can not use words which have been unrecognised on input.
 
 
====Technical Description====
 
Disambiguate target-language basewords/lemma, by choosing between alternative basewords/lemma. The choice is made by checking surrounding lemma for hints the translation is ambiguous.
 
 
 
====Example====
 
The English use the word "know". The English language has other words ("understand"), but the English use "know" widely. Translated into French, "know" becomes several words, "savoir", "connaître", "reconnaître", "s'informer" (and more). The translator must chose. But "savoir", in French, is not "reconnaître" (if the translation was the same, an acceptable default can be set in the bilingual dictionary).
 
 
But we know if what follows is a pronoun, or a proper noun, a person or thing, then "know" probably means "connaître", not "savoir" (the situation can be more complex than this, but this is a start). The lexical analyser allows us to write rules like, 'if "know" is followed by a pronoun, translate using "connaître"'.
 
 
Note that the rules match the target language, but select from the source language.
 
 
====Typical stream output====
 
The module only substitutes words. It will not change the stream form, which is the same as the output from the bidex. But see the available mode.
 
 
====Tool used====
 
 
: lrx-proc
 
 
====Auto Mode====
 
 
: xxx-yyy-lex.mode
 
 
====Configuration Files====
 
 
: apertium-xxx-yyy.xxx-yyy.lrx
 
 
====Links====
 
* [[How_to_get_started_with_lexical_selection_rules]]
 
* [[Constraint-based lexical selection module]]
 
 
 
 
 
===Chunker/Structural Transfer Stage 1/Intra chunk===
 
Identifies words, and groups of words, which may need to have their order altered, or tags adding, to help the target dictionary.
 
 
Configuring this stage is not necessary for making a basic pair.
 
 
This stage can be used by itself, ignoring the following two 'chunker' stages. In documentation this is called 'shallow transfer'. However, as Apertium has developed, a move has been made to three-stage transfer. In a template build, all three files are present.
 
 
Chunking involves marking the stream with special marks, which gather lemma/tag forms into groups. These 'chunks' can then be moved round in later chunker stages.
 
 
While all three transfer stages have a similar syntax, and appear to work in much the same way, there are, depending on the stage, some limitations on their actions. Please see the [http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf Apertium 2.0: Official documentation].
 
 
When used as a 'shallow' one-stage transfer, this stage marks the stream with chunk boundaries, modifies tags, and reorders and modifies within the chunks. In a three-stage transfer, this stage is usually limited to marking chunk boundaries, and adding/removing tags.
 
 
You may wonder, if this stage can do most of the work, why the other stages exist. There are two good reasons. First, there is the computer code reason, that it is best to separate different jobs into different areas. Second, there is the reason that computer coders think like this---if you do not split the work, you may find the computer code becomes a complex mess. Worse, you may find that by using the Chunker to do all 'chunk', and maybe disambiguation, work, that it is impossible to make some rules work---for example, if the chunker is busy reorganising words into Subject-Object-Verb order, it will not be able to add extra tags, check for internal inconsistencies, etc.
 
 
====Technical Description====
 
Flag grammatical divergences between SL and TL (e.g. gender or number agreement) by creating a sequence of chunks identified by special temporary markers in the text stream.
 
 
 
====Example====
 
: "violets; the daisy; lilies of all natures; the cherry-tree in blossom; honeysuckles; cherry-tree in fruit; figs in fruit; lavender in flowers; herba muscaria; the lime-tree in blossom; plums of all sorts in fruit; pears; apricocks; berberries; grapes; apples; poppies of all colors; peaches; nectarines; quinces..."
 
 
In English none of the items in this list of flowers and trees has a gender. But in many languages all the items will have gender. Preceding lexical units such as equivalents to 'a' or 'the' will need to agree. In en-es, the first transfer module has a rule to assemble the two lexical units into a 'chunk'. This,
 
 
: "a poppy"
 
 
becomes a 'chunk', and the chunk is given a tag to say the chunk is singular and the gender is feminine (this information is supplied by the bilingual dictionary, and clipped in the module),
 
 
: <f><sg>{a poppy}
 
 
If the gender information is not supplied, the gender is set to <GD>. <GD> is one of a few standard common tags; <SN> <GD>; gender to be determined <ND>; Number to be determined (singular/plural).
 
 
You can invent your own tags to pass information along.
 
 
Another example:
 
 
: "I may give only this advice"
 
 
If this is being translated to a Subject-Object-Verb language (English is a Subject-Verb-Object language, but many languages are not), it will need rearranging. If the wards are simple in grammar, they may be left as the morphological analysis has found them. More complex word sequences are gathered into a chunk and, if necessary, tagged,
 
 
: I <a verb>{may give only} <some object>{this advice}
 
 
The "may" and "only" words also need to be handled. They could be cut, which is also a job for this module. If they are not cut, then they will need to be tagged, so they can be re-ordered. To see what happens next, look at the next module, 'interchunk'.
 
 
====Typical stream output====
 
From,
 
 
: 'a poppy',
 
 
The curly brackets are the stream representation for a 'chunk',
 
 
<pre>
 
^Det_nom<SN><DET><f><sg>{^uno<det><ind><3><4>$ ^amapola<n><3><4>$}$
 
</pre>
 
 
====Tool used====
 
: apertium-transfer
 
 
====Auto Mode====
 
: xxx-yyy-chunker
 
 
====Configuration Files====
 
: apertium-xxx-yyy.xxx-yyy.t1x
 
 
====Links====
 
* [[Chunking]]
 
* [[Chunking:_A_full_example]]
 
 
 
 
 
 
===Transfer 2/InterChunk===
 
In a three-stage transfer, the second transfer stage orders chunked and tagged items from stage one.
 
 
Configuring this stage is not necessary for making a basic pair.
 
 
The detection is of patterns/sequences of chunks. This module can not match words in chunks, only the marks added to chunks.
 
 
For language pairs with no major reordering between chunks, this module is not needed. If the 't2x' file is not configured, the module passes data unaltered. For example, 'en-es' has a Postchunk module (see next section), but not an Interchunk module.
 
 
====Technical Description====
 
Reorder or modify chunk sequences (e.g. transfer noun gender to related adjectives).
 
 
 
====Example====
 
From the previous example,
 
 
: "I may give only this advice"
 
 
If this is being translated to a Subject-Object-Verb language (English is a Subject-Verb-Object language, but many languages are not), it will need rearranging. At the very least, the target dictionary will need,
 
 
: I advice give
 
 
and it is in this module the words are rearranged.
 
 
====Typical stream output====
 
The module only reorders chunks. It has no effect on the form of the stream. But see 'mode'.
 
 
====Tool used====
 
: apertium-interchunk
 
 
====Auto Mode====
 
: xxx-yyy-interchunk
 
 
====Configuration Files====
 
: apertium-xxx-yyy.xxx-yyy.t2x
 
 
====Links====
 
* [[Chunking:_A_full_example]]
 
 
 
 
===Transfer 3/PostChunk===
 
In a three-stage transfer, the third transfer stage interferes with the resolution and writing of chunks.
 
 
Configuring this stage is not necessary for making a basic pair.
 
 
Detection is not by pattern matching, it is by the name/lemma of the chunk itself. Position marks refer to the words/lexical units inside the chunks. The module will not write chunks, only lexical units and blanks.
 
 
So PostChunk is less abstracted than Transfer 2/InterChunk processing.
 
 
For language pairs with no rewriting of chunks, this module is not needed. If the 't3x' file is not configured, the module defaults to resolving and removing chunk data.
 
 
====Technical Description====
 
Substitute fully-tagged target-language forms into the chunks.
 
 
 
====Example====
 
Reducing a previous example, text arrives, prepared by the Chunker, labelled as feminine and singular. The following stripped version of the input shows the 'chunk' marks,
 
 
: ^<f><sg>{^uno<det><ind>$ ^amapola<n>$}$
 
 
Now the postchunk module must render this. In English, the chunk had no gender, so neither did the 'a' word/determiner. Now it has a gender, and the chunker stages have defined where these tags should be applied. In some cases of translation, the chunks may also be reordered.
 
 
====Typical stream output====
 
From,
 
 
: "a poppy"
 
 
Chunk marks have been removed from the stream (compare to 'chunker' output above), and tags distributed,
 
 
<pre>
 
^Uno<det><ind><f><sg>$ ^amapola<n><f><sg>$
 
</pre>
 
 
The output looks much the same as before the chunker stages. However, tags may have been added and deleted, and chunks of words recordered, to suit the target language.
 
 
====Tool used====
 
: apertium-postchunk
 
 
====Auto Mode====
 
: xxx-yyy-postchunk
 
 
====Configuration Files====
 
: apertium-xxx-yyy.xxx-yyy.t3x
 
 
====Links====
 
* [[Chunking:_A_full_example]]
 
 
 
 
 
 
===Morphological Generator===
 
'Generate' the surface forms of the translated words.
 
 
At this point, the text stream contains target language lemmas and tags, perhaps modified and prepared by the Lexical Selector and/or chunker stages. But this is not the final form. The Morphological Generator stage needs to takes the lemma and tags, then generate the target-language surface form e.g. it needs to take the lemma 'knife', and the tag '<pl>' (for plural), then generate 'knives'.
 
 
For this, the target language monodix is used in the direction right/left (in reverse of left->right reading). 'surface form' <- 'lexical unit'.
 
 
====Technical Description====
 
Use the lemma and tags ('lexical unit') to deliver the correct target-language surface form.
 
 
 
====Example====
 
The output, now translated,
 
 
: "He that travels into a country, before he has some entrance into the language, goes to school, and not to travel"
 
 
becomes,
 
 
: "Él aquello viaja a un país, antes de quei tiene alguna entrada a la lengua, va a escuela, y no para viajar"
 
 
This may be changed, for a few surface forms, by the Post Generator, and then formatted.
 
 
====Typical stream output====
 
The translation, now stripped of stream formatting, but before Post Generator and formatting. See the example above.
 
 
====Tool used====
 
: lt-proc -g
 
 
The switch/option,
 
 
: -g, --generation: morphological generation
 
 
====Auto Mode====
 
Post Chunker debug output (before Post Generator),
 
 
: xxx-yyy-dgen
 
 
====Configuration Files====
 
: apertium-xxx-yyy.yyy.dix
 
 
====Links====
 
* [[Monodix_basics]]
 
* [[Apertium_New_Language_Pair_HOWTO]]
 
 
 
 
 
 
===PostGenerator===
 
Corrects or localises spelling where the adjustment relies on the next word.
 
 
Configuring this stage is not necessary for making a basic pair.
 
 
The post-generator uses a dictionary very similar to a mono-dictionary, used as a generator. So it is capable of creating, modifying and removing text. It can use paradigms. However, please read the rest of this section! The generator must be triggered using an '<a/>' tag in the bidex.
 
 
The module was originally provided to convert Spanish-like 'de el' into 'del'. It also performs a good job on placing 'a'/'an' determiners before English nouns ('an apple'). Here you can see the two main features of the post-generator. First, it works on the text as generated, so can be used when the form of a word is closely related to the final form of the following word. Note that 'de el' and 'a'/'an' could not be handled earlier in the text stream/Apertium workflow, because we have no idea what the surface forms will be. These forms are only available after the generating monodix. The second feature is that, in general, the post-generator works by inflecting/selecting/replacing a word based on the following word.
 
 
The post-generator is sometimes referred to in documentation as intended for 'orthography'. Orthography is conventions of spelling, hyphenation, and other graphical display features i.e. the language side of typography. Perhaps that was the original intention for the Post Generator, but the module at the time of writing is unsuitable for many orthographic tasks. It displays several unexpected behaviours. Attempts at elision and compression, other than a 'de el'->'del' style of elision across the forward word boundary, are likely to fail. However, the module is so useful for these two cases alone that it is an established stage in the workflow.
 
 
====Technical Description====
 
Make final adjustments where a generated surface form relies on the next surface form.
 
 
 
====Example====
 
The example from the manual is Spanish,
 
 
: "de el"
 
 
which becomes,
 
 
: "del"
 
 
And the template includes an example in English,
 
 
: "a"
 
 
which becomes "an" before a vowel,
 
 
: "an apple" (but "a car")
 
 
Both examples are beyond pure orthography, but depend on the final surface forms and that they are next to each other,
 
 
The Post Generator handles difficult cases. For example, we translate into English,
 
 
: Un peligro inminente
 
 
The Post Generator will successfully handle the determiner, translating to,
 
 
: An imminent danger
 
 
It would be very difficult handle this action earlier in the Apertium workflow. It may also confuse intentions in the code, and maybe limit other work we needed to do.
 
 
But the Post Generator is not useful for some actions. English sometimes hyphenates groups of words,
 
 
: "But all this while, when I speak of vain-glory..."
 
 
Other common hyphenated groups are "misty-eyed", "follow-up", and "computer-aided". Finding a rule for this form of hyphenation is not easy. Let us imagine a rule exists. Unfortunately, the Post Generator could not handle the insertion of the hyphen, because it is made to recognise the following blank and either replace the first word, or the words as a whole. Language pair en-es handles the above examples by recognising these word groups in the initial monodix, not by Post Generator manipulation.
 
 
For the same reason, the Post-Generator can not handle another orthographic action in English; the use of the apostrophe. For example,
 
 
: "that is but a circle of tales"
 
 
will often become,
 
 
: that's but a circle of tales
 
 
The rule is clear and, for this example, reaches across a following blank. But the the Post Generator can give unintended results when manipulating a single letter (such as an 's') and is more predictable with a full replacement. Also, there are many such compressions and elisions in English ('where is' -> 'where's' etc.), so may be better to handle these with more general rules earlier in the workflow, or consider if the translation is better without them.
 
 
====Typical stream output====
 
The translation, now stripped of stream formatting e.g.
 
 
: "that is but a circle of tales"
 
 
gives,
 
 
<pre>
 
Aquello es pero un círculo de cuentos"
 
</pre>
 
 
====Tool used====
 
: lt-proc -p
 
 
The switch/option,
 
 
: '-p', --postgeneration
 
 
====Auto Mode====
 
Input,
 
 
: xxx-yyy-dgen
 
 
 
which is the debug output of the Post Chunker. Output, excluding formatting, is the finished product, so,
 
 
: xxx-yyy
 
 
====Configuration Files====
 
The files are in the mono-dictionary folders. For the source language,
 
 
: apertium-xxx.post-xxx.dix
 
 
 
For the target language,
 
 
: apertium-yyy.post-yyy.dix
 
 
==Links==
 
The post generator usually only contains a handful of rules, often common, so is not covered by much documentation. For depth, try the [http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf Apertium 2.0: Official documentation] (Sect. 3.1.2). For a quick-reference example,
 
 
* [[Post-generator]]
 
 
==Older/unusual modules==
 
These modules may only be used for special cases, or have been deprecated, but are present in some existing pairs. They are included in the binary .deb distribution.
 
 
===Constraint grammar===
 
Also called 'Visl cg3' and 'CG3'.
 
 
The 'Constraint Grammar' module is not used often. In Apertium, it is preferred, if possible, to use the 'Lexical Selection' module. The module is placed after the morphological analysis (the first monodix) and attempts to disambiguate text.
 
 
A Constraint Grammar module is a very powerful way of disambiguating input text. It uses rules (compare to the 'POS tagger' module below, which uses statistics). The Constraint Grammar module can tear text apart and restructure, like the 'chunker' modules. Also, it is a project which extends outside Apertium, so there may be Constraint Grammar code available to disambiguate languages.
 
 
While reusing and building on the work of others is a generous idea, the Apertium module has problems. The grammar is not consistent with other Apertium grammar, the power of the module leads to obscure code, and the module uses a lot of computing power. So we are not writing about it. If you are using a language pair which contains Constraint Grammar, or have a difficult language to translate, see the link below. Or ask on IRC.
 
 
====Technical Description====
 
Resolve ambiguous segments (e.e. where there is more than one match) by choosing, by rule building, a match.
 
 
 
==Links==
 
[http://wiki.apertium.org/wiki/Constraint_Grammar Constraint Grammar]
 
 
 
 
 
===POS tagger and lexical selection===
 
The 'POS Tagger' module is not used often. In Apertium, it is preferred, if possible, to use the 'Lexical Selection' module. The module is placed after the morphological analysis (the first monodix) and attempts to disambiguate text.
 
 
The module disambiguates text by testing word placements against statistics. The statistics are generated by feeding pre-translated text into the module. Another way of saying this; 'Ok, this word is ambiguous. Now, the module has been told about many real-world translations. When it saw those translations, how often, and where, was the ambiguous word translated to one form or another? Based on that, what is the best translation?'.
 
 
This module is not under direct control of a configuration file. It is trained by input of data. If the 'POS tagger' module is not trained, then data passes unchanged.
 
 
The tagger has some configuration. The TSX format is used to configure a tagging process, identifying tags from the initial analysis, setting them as 'coarse tags'. These tags are used for the statistic building.
 
 
'Coarse tags' are not much more than an overall name (e.g. "the word shrink as a noun"). The coarse tagging process can also accept or avoid some adjacent tag names (so "shrink" after "the" is a noun).
 
 
====Technical Description====
 
Resolve ambiguous segments (e.e. where there is more than one match) by choosing, by statistical training, a match.
 
 
 
====Example====
 
: "THE stage is more beholding to love"
 
 
This group of words can be read in several ways. 'The stage' may refer to a 'stage' mentioned earlier in the text; for example, 'the meeting'. And 'more beholding' may mean 'a place that creates love'. In the original text, 'the stage' means 'the theatre', and 'more beholding' means 'a place for love to be presented'.
 
 
It is very unlikely that a general translating software is trained in this old form of English. In one translation engine, a translation from English to Spanish and back resulted in "The scenario is looking more to love". The engine has tried, using a POS tagging/statistics-based analysis, to guess the intention of the words,
 
 
: "(means "steps in a process")[THE stage] is (means "to look towards")[more beholding] to love"
 
 
Here, the engine failed, but often it will succeed.
 
 
====Typical stream output====
 
The format of the output is a standard Apertium text stream. But the output may be changed from the morphological output in quality of analysis and volume. Here is tagger output, using the above example and the current English monodix,
 
 
: ^THE/THE<det><def><sp>$ ^stage/stage<n><sg>/stage<vblex><inf>/stage<vblex><imp>$ ^is/be<vbser><pres><p3><sg>$ ^more/more<adv>/more<preadv>/more<n><sg>/more<det><qnt><sp>$ ^beholding/*beholding$ ^to/to<pr>$ ^love/love<n><sg>/love<vblex><inf>/love<vblex><pres>/love<vblex><imp>$^./.<sent>$
 
 
but the tagger may have decided how to handle ambiguities and added new tags. Here is the output from the bidex tagger mode using the es-en pair,
 
 
: ^the<det><def><sp>$ ^stage<n><sg>$ ^be<vbser><pri><p3><sg>$ ^more<preadv>$ ^*beholding$ ^to<pr>$ ^love<vblex><inf>$^.<sent>$
 
 
You can see some ambiguities, such as the words 'stage' and 'love', which can be a noun or a verb, have been resolved.
 
 
====Tool used====
 
(not directly configurable)
 
 
: apertium-tagger
 
 
====Auto Mode====
 
In a monodix,
 
: xxx-tagger
 
 
In a bidix,
 
: xxx-yyy-tagger
 
 
====Configuration Files====
 
For the filter,
 
: apertium-xxx-yyy.xxx.tsx
 
 
(a new pair template template contains no .tsx files)
 
 
====Links====
 
[[Tagger training]]
 
[[TSX format]]
 
 
 
==References==
 
Text examples from,
 
* Francis Bacon (1625). [http://www.gutenberg.org/files/575/575-h/575-h.htm#link2H_4_0037 THE ESSAYS OR COUNSELS, CIVIL AND MORAL]. Project Gutenberg
 

Revision as of 10:27, 24 April 2017

En français

InstallationResourcesContactDocumentationDevelopmentTools

Gnome-home.png Home PageBugs.png BugsInternet.png WikiGaim.png Chat


Contents

To try Apertium

You can go online to the front page :)

There are several applications which work from the desktop without full installation. For these and more graphical user interfaces, services, plugins, etc. goto Tools.

If you would like install instructions for 'Apertium viewer', 'apy' (the Apertium server) etc. got to Tools. The install instructions can be found with the tool descriptions.


For those who want to install Apertium locally, and developers

How to install Apertium core[1] and language data on your system (developers may also want to consider their operating environment[2]).


Installing: a summary

Most people will need to,

Install Apertium Core by packaging/virtual environment

For translators: Install language data/dictionaries/pairs from repositories

Install language data using packaging, including hints about the private repository.

For language developers: Install language data/dictionaries/pairs by compiling


Alternatives

Installing Apertium core by compiling

Apertium maintains a private repository that is up-to-date and reliable. If you do not want to work in core, or develop languages, please use either packaging or a virtual environment. The packages stay up-to-date and are stable. A compile will waste your time.

However, if you are planning to work on Apertium core, or have an operating system not covered above, go right ahead, Install Apertium core by compiling[3]

Notes

  1. Apertium is a big system. There are many plugins, scripts, and extension projects. The core, the code which translates, is a multi-step set of tools joined by a stream format and, nowadays, invoked by scripts called 'modes'. You may also see the names 'lt-toolbox'/'lt-tools', 'apertium-lex-tools', and the simple title 'apertium'. These refer to groupings of the tools. Packaged or compiled, these tools can be installed as one unit. From here on, we call them 'Apertium core'.
  2. Apertium is written to be platform-independent. However, it can be difficult to maintain platform-independence over a project this wide. If you intend to do something deep with Apertium, you will gain more help from the tools if you use the Ubuntu, or a similar Debian-based, operating system. In no way does this mean that the Apertium project favours this platform.
  3. Most people know the word 'install'. It means 'put code in my operating system'. When developing, it is not usual to fully 'install'. You get the code working enough to get results. This is relevant to Apertium, which needs a rapid cycle for re-compiles. If you follow instructions to compile code, you will be discouraged from 'installing' builds. When we use the word 'install', we mean 'get code working on my computer'.

Installation Videos

Most of these videos have been produced by Google Code-In students.

Personal tools