Difference between revisions of "Format handling"

From Apertium
Jump to navigation Jump to search
(New page: 3.6.2 Data: format specification rules This section describes how the de-formatter and re-formatter are gener- ated from a format specification in XML. Rules for format, like linguisti...)
 
(Link to French page)
 
(61 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  +
[[Support du format d'un document|En français]]
3.6.2 Data: format specification rules
 
This section describes how the de-formatter and re-formatter are gener-
 
ated from a format specification in XML.
 
Rules for format, like linguistic data, are specified in XML, and they
 
contain regular expressions with flex syntax. The specification is divided
 
in three parts (see its DTD in the Appendix A.6):
 
• Configuration options. Here one specifies the value for the max-
 
imum length of a non-extensive superblank, the input and output
 
encodings, whether case must be considered, and the regular expres-
 
sions for escape characters and space characters.
 
• Format rules. Describes the set of tags belonging to a specific format
 
which have to be included in a block of format by the de-formatter.
 
These tags may, optionally, indicate a sentence end, in which case the
 
de-formatter will insert an artificial punctuation mark (followed by
 
an empty block of format, as explained in the previous section). One
 
has to specify the priority of application of the rules, although, when
 
this is not relevant, it is possible to give the same priority to all the
 
rules by assigning them the same value (any number).
 
Everything that is not specified as format will be left without encap-
 
sulation and, therefore, will be considered as translatable text.
 
• Replacement rules. Allow to replace special characters in the text. A
 
regular expression will recognize a set of special characters, and will
 
replace it with the specified characters. For example, in HTML, the
 
characters specified in hexadecimal have to be replaced with the cor-
 
responding entity or ASCII character. For example, camión
 
corresponds to cami ́n.
 
o
 
Rules are described in more detail next.
 
• Root of the specification file. The attribute name contains the name
 
of the format.
 
<?xml version="1.0" encoding="ISO-8859-1"?>
 
<format name="html">
 
<options>
 
...
 
</options>
 
<rules>
 
...
 
</rules>
 
</format>
 
It has to include the options and rules, an example of which is pre-
 
sented next:
 
• Options.
 
<options>
 
<largeblocks size="8192"/>
 
<input encoding="ISO-8859-1"/>
 
<output encoding="ISO-8859-1"/>
 
<escape-chars regexp=’[\[\]ˆ$\\]’/>
 
<space-chars regexp=’[ \n\t\r]’/>
 
<case-sensitive value="no"/>
 
</options>
 
The element <largeblocks> specifies the maximum length of a non-
 
extensive superblank, through the value of the attribute size. The ele-
 
ments <input> and <output> specify the input and output encoding of
 
the text, through the attribute encoding.
 
The element escape-chars specifies, by means of a regular expres-
 
sion declared in the value of the attribute regexp, which characters must
 
be escaped with a backslash. The element <space-chars> specifies the
 
set of characters that must be considered as blanks.
 
Finally, the element case-sensitive specifies if case is relevant in
 
the specifications of format attributes in which regular expressions are
 
contained.
 
• Rules. There are format rules and replacement rules.
 
<rules>
 
<format-rule ... >
 
...
 
</format-rule>
 
...
 
<replacement-rule>
 
...
 
</replacement-rule>
 
...
 
</rules>
 
The two types are described in the following points.
 
• Format rules. The de-formatter will encapsulate in blocks of format
 
the tags indicated by these rules in the field regexp. If they are begin
 
and end tags, and everything delimited by them is format, one has
 
to specify a regexp both for begin and for end:
 
<format-rule eos="no" priority="1">
 
<begin regexp=’"\&lt;!--"’/>
 
<end regexp=’"--\&gt;"’/>
 
</format-rule>
 
Otherwise only one begin-end element is used:
 
<format-rule eos="yes" priority="3">
 
<begin-end regexp=’"&lt;"[/]?"li"[ˆ&gt;]*"&gt;"’/>
 
</format-rule>
 
Besides, in priority you have to specify a priority to tell the sys-
 
tem in which order the format rules must be applied (the absolute
 
value is not relevant, only the order resulting from the values). In
 
“eos” you indicate, with yes or no, whether the block of format
 
that contains the detected pattern must be preceded by an artificial
 
punctuation mark or not.<ref> 11
 
In all these cases, the typical entities &lt; and &gt; are used to represent the char-
 
acters < and > respectively.
 
</ref>
 
   
  +
{{TOCD}}
• Replacement rules. Are used to replace special characters in the text.
 
  +
The regular expression in the attribute regexp will recognize a set
 
  +
'''Format handling''' in Apertium is done with special programs to encapsulate and de-encapsulate formatting information in "superblanks",<ref>Also referred to ''superblancos''</ref> which are delimited by the characters '''<code>[</code>''' and '''<code>]</code>''', so for example, for processing HTML, the program <code>apertium-deshtml</code> encapsulates the formatting information, while <code>apertium-rehtml</code> de-encapsulates (restores) it, as in the following example:
of special characters and will replace them with the specified char-
 
  +
acters in the text to be translated. The correspondence between orig-
 
  +
<pre>
inal and replacement characters is stated in the attributes source
 
  +
$ echo "<em>this is</em> a <b>test</b>" | apertium-deshtml
and target of the replace elements, which can be multiple:
 
  +
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]
<replacement-rule regexp=’"&amp;"[ˆ;]+;’>
 
  +
<replace source="&amp;Agrave;" target="`"/> A
 
  +
$ echo "<em>this is</em> a <b>test</b>" | apertium-deshtml | apertium-rehtml
<replace source="&amp;#192;" target="`"/> A
 
  +
<em>this is</em> a <b>test</b>
<replace source="&amp;#xC0;" target="`"/> A
 
  +
</pre>
<replace source="&amp;#xc0;" target="`"/> A
 
  +
<replace source="&amp;Aacute;" target=" ́"/> A
 
  +
To tell the whole translator to use a certain format, use <code>-f</code>, e.g. <code>apertium -f html oc-ca infile outfile</code>.
́"/>
 
  +
<replace source="&amp;#193;" target="A
 
  +
== Official formats carried out by Apertium ==
<replace source="&amp;#xC1;" target=" ́"/> A
 
  +
<replace source="&amp;#xc1;" target=" ́"/> A
 
  +
Currently, deformatters and reformatters are available for:
...
 
  +
</replacement-rule>
 
  +
* '''plain text''': <code>-f txt</code> (<code>apertium-destxt</code>)
• Regular expressions of regexp attributes. They have the syntax
 
  +
* '''HTML''': <code>-f html</code> or <code>-f html-noent</code> (<code>apertium-deshtml</code>)
used in flex [9].
 
  +
** <code>html</code> prints non-ASCII chars using entities, <code>html-noent</code> keeps non-ASCII as-is; the difference is in the reformatter
As example of a format specification, we will give that for HTML. The
 
  +
* '''RTF''': <code>-f rtf</code> (<code>apertium-desrtf</code>)
explanation given in the following paragraphs can be followed looking at
 
  +
* '''OpenOffice.org Writer ODT''': <code>-f odt</code> (<code>apertium-desodt</code>)
Figure 3.49.
 
  +
* '''Microsoft Word DOCX, WXML''': <code>-f wxml</code> (<code>apertium-deswxml</code>)
In the first place, we find the format rule that specifies in a general way
 
  +
* '''Microsoft Powerpoint PPTX''': <code>-f pptx</code> (<code>apertium-despptx</code>)
all the HTML tags: it considers as HTML tag everything that begins with
 
  +
* '''Microsoft Excel XLSX''': <code>-f xlsx</code> (<code>apertium-desxlsx</code>)
the sign < and ends with the sign >. This rule has the lowest priority (4)
 
  +
* '''QuarkXPress XpressTag''': <code>-f xpresstag</code> (<code>apertium-desxpresstag</code>)
so that the more specific rules are applied preferentially. But before con-
 
  +
* '''MediaWiki''': <code>-f wikimedia</code> (<code>apertium-desmediawiki</code> -- still a work in progress, see [[Translating wikimedia]])
sidering a tag in a general way by applying this rule, some of the higher
 
  +
priority rules will be applied. In the case of HTML, the highest prior-
 
  +
There is as of yet no built-in handling of '''gettext''' <code>.po</code> files or subtitle formats, but see [[Translating gettext]] and [[Translating subtitles]] for very simple solutions. More formats at [[Tips for translators]].
ity is for comments <!-- ... -->. The marks for beginning and end
 
  +
<script> </script> and <style> </style>, where everything in-
 
  +
Some "special" features and gotcha's:
cluded by them is considered to be format, has priority 2. Priority 3 is
 
  +
* apertium-destxt adds a full stop before any line-break that's not followed by text, meaning you sometimes get two full stops, apertium-deshtml does this with paragraph markup. To avoid this, you currently need to [https://gist.github.com/unhammer/7689291 patch apertium] ([http://thread.gmane.org/gmane.comp.nlp.apertium/1481 some discussion]).
for tags that indicate end of sentence (artificial punctuation), which are
 
  +
* apertium-deshtml and other xml-based formatters accept the tag &lt;apertium-notrans&gt; to mean "don't translate this"; so if you have text that is not markup that you don't want translated, wrap it in that element like this:
</br>, </hr>, </p>, etc.
 
  +
*: <pre>text to be translated<apertium-notrans>don't translate me</apertium-notrans> translate again</pre>
Last of all are the replacement rules, which replace all the codes that
 
  +
* To translate document formats like ODT, you may have to pass the infile/outfile on the command line instead of piping, e.g. do <code>apertium -f odt oc-ca in.odt out.odt</code>, not <strike><code>cat in.odt | apertium -f odt oc-ca > out.odt</code></strike>.
begin with &, as specified in the regular expression. Then, each one of the
 
  +
replacements is defined: &Agrave, as well as &#192, &#xC0 and &#xc0
 
  +
== Formats carried out by separate packages ==
are replaced with `. The remaining special characters are declared in the
 
  +
A
 
  +
Other deformatters and reformatters were written directly in C or C++ language without using XML files. So, they don't follow format specification described in the following chapters. Therefore, they are distributed in separate packages.
same way.
 
  +
<?xml version="1.0" encoding="ISO-8859-1"?>
 
  +
* '''apertium-mediawiki''' is a package written in C++ that handles format for wikimedia documents with a better support of links.
  +
  +
* '''apertium-c-formatters''' is a package written in C that handles formats for :
  +
** [[Translating man pages|man pages]]
  +
** [[Translating mnemonic files|mnemonic files]] (an alternative to <code>.po</code> files for multilingual user interfaces).
  +
  +
== Limitations ==
  +
The apertium tools can not deal with reordered superblanks. The following example shows what happens when superblanks exist betweeen reordered words/chunks:
  +
<pre>
  +
$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
  +
<i>White</i> <b>dog</b>
  +
</pre>
  +
  +
This is currently a ''hard'' problem to fix and will require changes to both the transfer engine and the transfer rules.
  +
  +
Read more at [[Reordering superblanks]].
  +
  +
== Format specification ==
  +
  +
''This paragraph and the followings apply only to formats officially supported by Apertium.''
  +
  +
This section describes how the de-formatter and re-formatter are generated from a format specification in XML. Rules for format, like linguistic data, are specified in XML, and they contain regular expressions with flex syntax. The specification is divided in three parts (see its DTD in the Appendix A.6):
  +
  +
*'''Configuration options'''. Here one specifies the value for the maximum length of a non-extensive superblank, the input and output encodings, whether case must be considered, and the regular expressions for escape characters and space characters.
  +
*'''Format rules'''. Describes the set of tags belonging to a specific format which have to be included in a block of format by the de-formatter. These tags may, optionally, indicate a sentence end, in which case the de-formatter will insert an artificial punctuation mark (followed by an empty block of format, as explained in the previous section). One has to specify the priority of application of the rules, although, when this is not relevant, it is possible to give the same priority to all the rules by assigning them the same value (any number). Everything that is not specified as format will be left without encapsulation and, therefore, will be considered as translatable text.
  +
*'''Replacement rules'''. Allow to replace special characters in the text. A regular expression will recognise a set of special characters, and will replace it with the specified characters. For example, in HTML, the characters specified in hexadecimal have to be replaced with the corresponding entity or ASCII character. For example, cami&amp;oacute;n corresponds to camión.
  +
  +
==Root of the specification file==
  +
The attribute name contains the name of the format.
  +
  +
<pre>
  +
<?xml version="1.0" encoding="UTF-8"?>
  +
<format name="html">
  +
<options>
  +
...
  +
</options>
  +
<rules>
  +
...
  +
</rules>
  +
</format>
  +
</pre>
  +
  +
It has to include the options and rules, an example of which is presented next:
  +
  +
==Options==
  +
  +
The element <largeblocks> specifies the maximum length of a non-extensive superblank, through the value of the attribute size. The elements <code><input></code> and <code><output></code> specify the input and output encoding of the text, through the attribute encoding. The element <code>escape-chars</code> specifies, by means of a regular expression declared in the value of the attribute regexp, which characters must be escaped with a backslash. The element <code><space-chars></code> specifies the set of characters that must be considered as blanks. Finally, the element case-sensitive specifies if case is relevant in the specifications of format attributes in which regular expressions are contained.
  +
  +
;Example
  +
  +
<pre>
  +
<options>
  +
<largeblocks size="8192"/>
  +
<input encoding="UTF-8"/>
  +
<output encoding="UTF-8"/>
  +
<escape-chars regexp=’[\[\]ˆ$\\]’/>
  +
<space-chars regexp=’[ \n\t\r]’/>
  +
<case-sensitive value="no"/>
  +
</options>
  +
</pre>
  +
  +
==Rules==
  +
  +
There are format rules and replacement rules.
  +
  +
<pre>
  +
<rules>
  +
<format-rule ... >
  +
...
  +
</format-rule>
  +
...
  +
<replacement-rule>
  +
...
  +
</replacement-rule>
  +
...
  +
</rules>
  +
</pre>
  +
  +
The two types are described in the following points.
  +
  +
===Format rules===
  +
  +
The de-formatter will encapsulate in blocks of format the tags indicated by these rules in the field regexp. If they are begin and end tags, and everything delimited by them is format, one has to specify a regexp both for begin and for end:
  +
<pre>
  +
<format-rule eos="no" priority="1">
  +
<begin regexp=’"\&lt;!--"’/>
  +
<end regexp=’"--\&gt;"’/>
  +
</format-rule>
  +
</pre>
  +
Otherwise only one begin-end element is used:
  +
<pre>
  +
<format-rule eos="yes" priority="3">
  +
<begin-end regexp=’"&lt;"[/]?"li"[ˆ&gt;]*"&gt;"’/>
  +
</format-rule>
  +
</pre>
  +
  +
Besides, in priority you have to specify a priority to tell the system in which order the format rules must be applied (the absolute value is not relevant, only the order resulting from the values). In “eos” you indicate, with yes or no, whether the block of format that contains the detected pattern must be preceded by an artificial punctuation mark or not.<ref>In all these cases, the typical entities &amp;lt; and &amp;gt; are used to represent the characters < and > respectively.</ref>
  +
  +
===Replacement rules===
  +
  +
Are used to replace special characters in the text. The regular expression in the attribute regexp will recognise a set of special characters and will replace them with the specified characters in the text to be translated. The correspondence between original and replacement characters is stated in the attributes source and target of the replace elements, which can be multiple:
  +
  +
<pre>
  +
<replacement-ruleregexp='"&"[ˆ;]+;'>
  +
<replacesource="&Agrave;"target="À"/>
  +
<replacesource="&#192;"target="À"/>
  +
<replacesource="&#xC0;"target="À"/>
  +
<replacesource="&#xc0;"target="À"/>
  +
<replacesource="&Aacute;"target="Á"/>
  +
<replacesource="&#193;"target="Á"/>
  +
<replacesource="&#xC1;"target="Á"/>
  +
<replacesource="&#xc1;"target="Á"/>
  +
...
  +
</replacement-rule>
  +
</pre>
  +
  +
===Regular expressions of regexp attributes===
  +
They have the syntax used in flex. As example of a format specification, we will give that for HTML. The explanation given in the following paragraphs can be followed looking at Figure 3.49. In the first place, we find the format rule that specifies in a general way all the HTML tags: it considers as HTML tag everything that begins with the sign <code><</code> and ends with the sign <code>></code>.
  +
  +
These rules have the priority 4, which is the lowest priority so that the more specific rules are applied preferentially. But before considering a tag in a general way by applying this rule, some of the higher priority rules will be applied.
  +
  +
In the case of HTML,
  +
  +
* '''Priority 1''': The highest priority is for comments <code><nowiki><!-- ... --></nowiki></code>
  +
* '''Priority 2''': The marks for beginning and end <code><script> </script></code> and <code><style> </style></code>, where everything included by them is considered to be format.
  +
* '''Priority 3''': is for tags that indicate end of sentence (artificial punctuation), which are <code><nowiki><br/></nowiki></code>, <code></hr></code>, <code></p></code>, etc.
  +
* '''Priority 4''': Last of all are the replacement rules, which replace all the codes that begin with <code>&</code>, as specified in the regular expression. Then, each one of the replacements is defined: <code>&amp;Agrave;</code>, as well as <code>&amp;#192;</code>, <code>&amp;#xC0;</code> and <code>&amp;#xc0;</code> are replaced with <code>À</code>. The remaining special characters are declared in the same way.
  +
  +
;Example
  +
  +
<pre>
  +
<?xml version="1.0" encoding="UTF-8"?>
 
<format name="html">
 
<format name="html">
 
<options>
 
<options>
 
<largeblocks size="8192"/>
 
<largeblocks size="8192"/>
<input encoding="ISO-8859-1"/>
+
<input encoding="UTF-8"/>
<output encoding="ISO-8859-1"/>
+
<output encoding="UTF-8"/>
<escape-chars regexp=[\[\]ˆ$\\]/>
+
<escape-chars regexp='[\[\]ˆ$\\]'/>
<space-chars regexp=[ \ n\ t\ r]/>
+
<space-chars regexp='[ \ n\ t\ r]'/>
 
<case-sensitive value="no"/>
 
<case-sensitive value="no"/>
 
</options>
 
</options>
 
<rules>
 
<rules>
 
<format-rule eos="no" priority="1">
 
<format-rule eos="no" priority="1">
<begin regexp="&lt;!--"/>
+
<begin regexp='"&lt;!--"'/>
<end regexp="--&gt;"/>
+
<end regexp='"--&gt;"'/>
 
</format-rule>
 
</format-rule>
 
<format-rule eos="no" priority="2">
 
<format-rule eos="no" priority="2">
<begin regexp="&lt;script"[ˆ&gt;]*"&gt;"/>
+
<begin regexp='"&lt;script"[ˆ&gt;]*"&gt;"'/>
<end regexp="&lt;/script"[ˆ&gt;]*"&gt;"/>
+
<end regexp='"&lt;/script"[ˆ&gt;]*"&gt;"'/>
 
</format-rule>
 
</format-rule>
 
<format-rule eos="no" priority="2">
 
<format-rule eos="no" priority="2">
<begin regexp="&lt;style"[ˆ&gt;]*"&gt;"/>
+
<begin regexp='"&lt;style"[ˆ&gt;]*"&gt;"'/>
<end regexp="&lt;/style"[ˆ&gt;]*"&gt;"/>
+
<end regexp='"&lt;/style"[ˆ&gt;]*"&gt;"'/>
 
</format-rule>
 
</format-rule>
 
<format-rule eos="yes" priority="3">
 
<format-rule eos="yes" priority="3">
<begin-end regexp="&lt;"[/]?"br"[ˆ&gt;]*"&gt;"/>
+
<begin-end regexp='"&lt;"[/]?"br"[ˆ&gt;]*"&gt;"'/>
 
</format-rule>
 
</format-rule>
 
<!-- Here come more declarations of format-rule eos="yes"-->
 
<!-- Here come more declarations of format-rule eos="yes"-->
 
<!-- ... -->
 
<!-- ... -->
 
<format-rule eos="no" priority="4">
 
<format-rule eos="no" priority="4">
<begin-end regexp="&lt;"[a-zA-Z][ˆ&gt;]*"&gt;"/>
+
<begin-end regexp='"&lt;"[a-zA-Z][ˆ&gt;]*"&gt;"'/>
 
</format-rule>
 
</format-rule>
 
<replacement-rule regexp=’"&amp;"[ˆ;]+;’>
 
<replacement-rule regexp=’"&amp;"[ˆ;]+;’>
<replace source="&amp;Agrave;" target="`"/>
+
<replace source="&amp;Agrave;" target="À"/>
  +
<replace source="&amp;#192;" target="À"/>
A
 
  +
<replace source="&amp;#xC0;" target="À"/>
`"/>
 
<replace source="&amp;#192;" target="A
+
<replace source="&amp;#xc0;" target="À"/>
<replace source="&amp;#xC0;" target="`"/>
+
<!-- Here come more replace elements -->
A
+
<!-- -- -->
<replace source="&amp;#xc0;" target="`"/>
 
A
 
<!-- Here come more replace elements
 
<!-- ...
 
 
</replacement-rule>
 
</replacement-rule>
 
</rules>
 
</rules>
 
</format>
 
</format>
  +
</pre>
  +
  +
==Notes==
  +
<references/>
  +
  +
==See also==
  +
  +
* [[Apertium stream format]]
  +
* [[Tips for translators]]
  +
  +
[[Category:Documentation]]
  +
[[Category:Formats]]
  +
[[Category:Documentation in English]]

Latest revision as of 09:49, 6 October 2014

En français

Format handling in Apertium is done with special programs to encapsulate and de-encapsulate formatting information in "superblanks",[1] which are delimited by the characters [ and ], so for example, for processing HTML, the program apertium-deshtml encapsulates the formatting information, while apertium-rehtml de-encapsulates (restores) it, as in the following example:

$ echo "<em>this is</em> a <b>test</b>" | apertium-deshtml
[<em>]this is[<\/em> ]a[ <b>]test.[][<\/b>]

$ echo "<em>this is</em> a <b>test</b>" | apertium-deshtml | apertium-rehtml
<em>this is</em> a <b>test</b>

To tell the whole translator to use a certain format, use -f, e.g. apertium -f html oc-ca infile outfile.

Official formats carried out by Apertium[edit]

Currently, deformatters and reformatters are available for:

  • plain text: -f txt (apertium-destxt)
  • HTML: -f html or -f html-noent (apertium-deshtml)
    • html prints non-ASCII chars using entities, html-noent keeps non-ASCII as-is; the difference is in the reformatter
  • RTF: -f rtf (apertium-desrtf)
  • OpenOffice.org Writer ODT: -f odt (apertium-desodt)
  • Microsoft Word DOCX, WXML: -f wxml (apertium-deswxml)
  • Microsoft Powerpoint PPTX: -f pptx (apertium-despptx)
  • Microsoft Excel XLSX: -f xlsx (apertium-desxlsx)
  • QuarkXPress XpressTag: -f xpresstag (apertium-desxpresstag)
  • MediaWiki: -f wikimedia (apertium-desmediawiki -- still a work in progress, see Translating wikimedia)

There is as of yet no built-in handling of gettext .po files or subtitle formats, but see Translating gettext and Translating subtitles for very simple solutions. More formats at Tips for translators.

Some "special" features and gotcha's:

  • apertium-destxt adds a full stop before any line-break that's not followed by text, meaning you sometimes get two full stops, apertium-deshtml does this with paragraph markup. To avoid this, you currently need to patch apertium (some discussion).
  • apertium-deshtml and other xml-based formatters accept the tag <apertium-notrans> to mean "don't translate this"; so if you have text that is not markup that you don't want translated, wrap it in that element like this:
    text to be translated<apertium-notrans>don't translate me</apertium-notrans> translate again
  • To translate document formats like ODT, you may have to pass the infile/outfile on the command line instead of piping, e.g. do apertium -f odt oc-ca in.odt out.odt, not cat in.odt | apertium -f odt oc-ca > out.odt.

Formats carried out by separate packages[edit]

Other deformatters and reformatters were written directly in C or C++ language without using XML files. So, they don't follow format specification described in the following chapters. Therefore, they are distributed in separate packages.

  • apertium-mediawiki is a package written in C++ that handles format for wikimedia documents with a better support of links.
  • apertium-c-formatters is a package written in C that handles formats for :

Limitations[edit]

The apertium tools can not deal with reordered superblanks. The following example shows what happens when superblanks exist betweeen reordered words/chunks:

$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
<i>White</i> <b>dog</b>

This is currently a hard problem to fix and will require changes to both the transfer engine and the transfer rules.

Read more at Reordering superblanks.

Format specification[edit]

This paragraph and the followings apply only to formats officially supported by Apertium.

This section describes how the de-formatter and re-formatter are generated from a format specification in XML. Rules for format, like linguistic data, are specified in XML, and they contain regular expressions with flex syntax. The specification is divided in three parts (see its DTD in the Appendix A.6):

  • Configuration options. Here one specifies the value for the maximum length of a non-extensive superblank, the input and output encodings, whether case must be considered, and the regular expressions for escape characters and space characters.
  • Format rules. Describes the set of tags belonging to a specific format which have to be included in a block of format by the de-formatter. These tags may, optionally, indicate a sentence end, in which case the de-formatter will insert an artificial punctuation mark (followed by an empty block of format, as explained in the previous section). One has to specify the priority of application of the rules, although, when this is not relevant, it is possible to give the same priority to all the rules by assigning them the same value (any number). Everything that is not specified as format will be left without encapsulation and, therefore, will be considered as translatable text.
  • Replacement rules. Allow to replace special characters in the text. A regular expression will recognise a set of special characters, and will replace it with the specified characters. For example, in HTML, the characters specified in hexadecimal have to be replaced with the corresponding entity or ASCII character. For example, cami&oacute;n corresponds to camión.

Root of the specification file[edit]

The attribute name contains the name of the format.

<?xml version="1.0" encoding="UTF-8"?>
<format name="html">
  <options>
    ...
  </options>
  <rules>
    ...
  </rules>
</format>

It has to include the options and rules, an example of which is presented next:

Options[edit]

The element <largeblocks> specifies the maximum length of a non-extensive superblank, through the value of the attribute size. The elements <input> and <output> specify the input and output encoding of the text, through the attribute encoding. The element escape-chars specifies, by means of a regular expression declared in the value of the attribute regexp, which characters must be escaped with a backslash. The element <space-chars> specifies the set of characters that must be considered as blanks. Finally, the element case-sensitive specifies if case is relevant in the specifications of format attributes in which regular expressions are contained.

Example
  <options>
    <largeblocks size="8192"/>
    <input encoding="UTF-8"/>
    <output encoding="UTF-8"/>
    <escape-chars regexp=’[\[\]ˆ$\\]’/>
    <space-chars regexp=’[ \n\t\r]’/>
    <case-sensitive value="no"/>
  </options>

Rules[edit]

There are format rules and replacement rules.

  <rules>
    <format-rule ... >
      ...
    </format-rule>
      ... 
    <replacement-rule>
      ...
    </replacement-rule>
      ...
    </rules>

The two types are described in the following points.

Format rules[edit]

The de-formatter will encapsulate in blocks of format the tags indicated by these rules in the field regexp. If they are begin and end tags, and everything delimited by them is format, one has to specify a regexp both for begin and for end:

  <format-rule eos="no" priority="1">
    <begin regexp=’"\<!--"’/>
    <end regexp=’"--\>"’/>
  </format-rule>

Otherwise only one begin-end element is used:

  <format-rule eos="yes" priority="3">
    <begin-end regexp=’"<"[/]?"li"[ˆ>]*">"’/>
  </format-rule>

Besides, in priority you have to specify a priority to tell the system in which order the format rules must be applied (the absolute value is not relevant, only the order resulting from the values). In “eos” you indicate, with yes or no, whether the block of format that contains the detected pattern must be preceded by an artificial punctuation mark or not.[2]

Replacement rules[edit]

Are used to replace special characters in the text. The regular expression in the attribute regexp will recognise a set of special characters and will replace them with the specified characters in the text to be translated. The correspondence between original and replacement characters is stated in the attributes source and target of the replace elements, which can be multiple:

  <replacement-ruleregexp='"&"[ˆ;]+;'>
    <replacesource="À"target="À"/>
    <replacesource="À"target="À"/>
    <replacesource="À"target="À"/>
    <replacesource="À"target="À"/>
    <replacesource="Á"target="Á"/>
    <replacesource="Á"target="Á"/>
    <replacesource="Á"target="Á"/>
    <replacesource="Á"target="Á"/>
    ...
  </replacement-rule>

Regular expressions of regexp attributes[edit]

They have the syntax used in flex. As example of a format specification, we will give that for HTML. The explanation given in the following paragraphs can be followed looking at Figure 3.49. In the first place, we find the format rule that specifies in a general way all the HTML tags: it considers as HTML tag everything that begins with the sign < and ends with the sign >.

These rules have the priority 4, which is the lowest priority so that the more specific rules are applied preferentially. But before considering a tag in a general way by applying this rule, some of the higher priority rules will be applied.

In the case of HTML,

  • Priority 1: The highest priority is for comments <!-- ... -->
  • Priority 2: The marks for beginning and end <script> </script> and <style> </style>, where everything included by them is considered to be format.
  • Priority 3: is for tags that indicate end of sentence (artificial punctuation), which are <br/>, ,

    , etc.
  • Priority 4: Last of all are the replacement rules, which replace all the codes that begin with &, as specified in the regular expression. Then, each one of the replacements is defined: &Agrave;, as well as &#192;, &#xC0; and &#xc0; are replaced with À. The remaining special characters are declared in the same way.
Example
 <?xml version="1.0" encoding="UTF-8"?>
 <format name="html">
   <options>
     <largeblocks size="8192"/>
     <input encoding="UTF-8"/>
     <output encoding="UTF-8"/>
     <escape-chars regexp='[\[\]ˆ$\\]'/>
     <space-chars regexp='[ \ n\ t\ r]'/>
     <case-sensitive value="no"/>
   </options>
   <rules>
    <format-rule eos="no" priority="1">
       <begin regexp='"<!--"'/>
      <end regexp='"-->"'/>
    </format-rule>
    <format-rule eos="no" priority="2">
      <begin regexp='"<script"[ˆ>]*">"'/>
      <end regexp='"</script"[ˆ>]*">"'/>
    </format-rule>
    <format-rule eos="no" priority="2">
      <begin regexp='"<style"[ˆ>]*">"'/>
      <end regexp='"</style"[ˆ>]*">"'/>
    </format-rule>
    <format-rule eos="yes" priority="3">
      <begin-end regexp='"<"[/]?"br"[ˆ>]*">"'/>
    </format-rule>
    <!-- Here come more declarations of format-rule eos="yes"-->
    <!-- ...                                                -->
    <format-rule eos="no" priority="4">
      <begin-end regexp='"<"[a-zA-Z][ˆ>]*">"'/>
    </format-rule>
    <replacement-rule regexp=’"&"[ˆ;]+;’>
      <replace source="&Agrave;" target="À"/>
      <replace source="&#192;" target="À"/>
      <replace source="&#xC0;" target="À"/>
      <replace source="&#xc0;" target="À"/>
      <!-- Here come more replace elements                  -->
      <!-- --                                               -->
    </replacement-rule>
  </rules>
</format>

Notes[edit]

  1. Also referred to superblancos
  2. In all these cases, the typical entities &lt; and &gt; are used to represent the characters < and > respectively.

See also[edit]