Format handling

From Apertium
Revision as of 13:37, 30 April 2008 by Francis Tyers (talk | contribs) (New page: 3.6.2 Data: format specification rules This section describes how the de-formatter and re-formatter are gener- ated from a format specification in XML. Rules for format, like linguisti...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

3.6.2 Data: format specification rules This section describes how the de-formatter and re-formatter are gener- ated from a format specification in XML.

   Rules for format, like linguistic data, are specified in XML, and they

contain regular expressions with flex syntax. The specification is divided in three parts (see its DTD in the Appendix A.6):

   • Configuration options. Here one specifies the value for the max-
     imum length of a non-extensive superblank, the input and output
     encodings, whether case must be considered, and the regular expres-
     sions for escape characters and space characters.
   • Format rules. Describes the set of tags belonging to a specific format
     which have to be included in a block of format by the de-formatter.
     These tags may, optionally, indicate a sentence end, in which case the
     de-formatter will insert an artificial punctuation mark (followed by
     an empty block of format, as explained in the previous section). One
     has to specify the priority of application of the rules, although, when
     this is not relevant, it is possible to give the same priority to all the
     rules by assigning them the same value (any number).
     Everything that is not specified as format will be left without encap-
     sulation and, therefore, will be considered as translatable text.
   • Replacement rules. Allow to replace special characters in the text. A
     regular expression will recognize a set of special characters, and will
     replace it with the specified characters. For example, in HTML, the
      characters specified in hexadecimal have to be replaced with the cor-
      responding entity or ASCII character. For example, camión
      corresponds to cami ́n.
                           o
   Rules are described in more detail next.
   • Root of the specification file. The attribute name contains the name
      of the format.
      <?xml version="1.0" encoding="ISO-8859-1"?>
      <format name="html">
         <options>
         ...
         </options>
         <rules>
         ...
         </rules>
      </format>
   It has to include the options and rules, an example of which is pre-

sented next:

   • Options.
         <options>
           <largeblocks size="8192"/>
           <input encoding="ISO-8859-1"/>
           <output encoding="ISO-8859-1"/>
           <escape-chars regexp=’[\[\]ˆ$\\]’/>
           <space-chars regexp=’[ \n\t\r]’/>
           <case-sensitive value="no"/>
         </options>
   The element <largeblocks> specifies the maximum length of a non-

extensive superblank, through the value of the attribute size. The ele- ments <input> and <output> specify the input and output encoding of the text, through the attribute encoding.

   The element escape-chars specifies, by means of a regular expres-

sion declared in the value of the attribute regexp, which characters must be escaped with a backslash. The element <space-chars> specifies the set of characters that must be considered as blanks.

   Finally, the element case-sensitive specifies if case is relevant in

the specifications of format attributes in which regular expressions are contained. • Rules. There are format rules and replacement rules.

    <rules>
       <format-rule ... >
         ...
       </format-rule>
       ...
       <replacement-rule>
         ...
       </replacement-rule>
       ...
    </rules>
 The two types are described in the following points.

• Format rules. The de-formatter will encapsulate in blocks of format

 the tags indicated by these rules in the field regexp. If they are begin
 and end tags, and everything delimited by them is format, one has
 to specify a regexp both for begin and for end:
       <format-rule eos="no" priority="1">
         <begin regexp=’"\<!--"’/>
         <end regexp=’"--\>"’/>
       </format-rule>
 Otherwise only one begin-end element is used:
       <format-rule eos="yes" priority="3">
         <begin-end regexp=’"<"[/]?"li"[ˆ>]*">"’/>
       </format-rule>
 Besides, in priority you have to specify a priority to tell the sys-
 tem in which order the format rules must be applied (the absolute
 value is not relevant, only the order resulting from the values). In
 “eos” you indicate, with yes or no, whether the block of format
 that contains the detected pattern must be preceded by an artificial
 punctuation mark or not.[1]
   • Replacement rules. Are used to replace special characters in the text.
      The regular expression in the attribute regexp will recognize a set
      of special characters and will replace them with the specified char-
      acters in the text to be translated. The correspondence between orig-
      inal and replacement characters is stated in the attributes source
      and target of the replace elements, which can be multiple:
           <replacement-rule regexp=’"&"[ˆ;]+;’>
              <replace source="&Agrave;" target="`"/>       A
              <replace source="&#192;" target="`"/>      A
              <replace source="&#xC0;" target="`"/>      A
              <replace source="&#xc0;" target="`"/>      A
              <replace source="&Aacute;" target=" ́"/>       A
                                                              ́"/>
              <replace source="&#193;" target="A
              <replace source="&#xC1;" target=" ́"/>      A
              <replace source="&#xc1;" target=" ́"/>      A
              ...
           </replacement-rule>
   • Regular expressions of regexp attributes. They have the syntax
      used in flex [9].
   As example of a format specification, we will give that for HTML. The

explanation given in the following paragraphs can be followed looking at Figure 3.49.

   In the first place, we find the format rule that specifies in a general way

all the HTML tags: it considers as HTML tag everything that begins with the sign < and ends with the sign >. This rule has the lowest priority (4) so that the more specific rules are applied preferentially. But before con- sidering a tag in a general way by applying this rule, some of the higher priority rules will be applied. In the case of HTML, the highest prior- ity is for comments . The marks for beginning and end <script> </script> and <style> </style>, where everything in- cluded by them is considered to be format, has priority 2. Priority 3 is for tags that indicate end of sentence (artificial punctuation), which are


, ,

, etc.

   Last of all are the replacement rules, which replace all the codes that

begin with &, as specified in the regular expression. Then, each one of the replacements is defined: &Agrave, as well as &#192, &#xC0 and &#xc0 are replaced with `. The remaining special characters are declared in the

                    A

same way.

<?xml version="1.0" encoding="ISO-8859-1"?>
<format name="html">
  <options>
    <largeblocks size="8192"/>
    <input encoding="ISO-8859-1"/>
    <output encoding="ISO-8859-1"/>
    <escape-chars regexp=’[\[\]ˆ$\\]’/>
    <space-chars regexp=’[ \ n\ t\ r]’/>
    <case-sensitive value="no"/>
  </options>
  <rules>
   <format-rule eos="no" priority="1">
      <begin regexp=’"<!--"’/>
     <end regexp=’"-->"’/>
   </format-rule>
   <format-rule eos="no" priority="2">
     <begin regexp=’"<script"[ˆ>]*">"’/>
     <end regexp=’"</script"[ˆ>]*">"’/>
   </format-rule>
   <format-rule eos="no" priority="2">
     <begin regexp=’"<style"[ˆ>]*">"’/>
     <end regexp=’"</style"[ˆ>]*">"’/>
   </format-rule>
   <format-rule eos="yes" priority="3">
     <begin-end regexp=’"<"[/]?"br"[ˆ>]*">"’/>
   </format-rule>
   <format-rule eos="no" priority="4">
     <begin-end regexp=’"<"[a-zA-Z][ˆ>]*">"’/>
   </format-rule>
   <replacement-rule regexp=’"&"[ˆ;]+;’>
     <replace source="&Agrave;" target="`"/>
                                            A
                                          `"/>
     <replace source="&#192;" target="A
     <replace source="&#xC0;" target="`"/>
                                          A
     <replace source="&#xc0;" target="`"/>
                                          A
  1. 11 In all these cases, the typical entities < and > are used to represent the char- acters < and > respectively.