Capitalization restoration

From Apertium
Jump to navigation Jump to search

This page describes the modules for extracting and restoring capitalization information which were added in apertium 3.9.0.

Using these modules, pipelines can be changed so that every step after the analyzer operates solely on dictionary case and there is no need to track the proper placement of case in transfer.

Extracting Capitalization[edit]

The program apertium-extract-caps stores information about the source language capitalization in word-bound blanks with the prefix c:.

echo '^THE/the<det>$ ^Green/green<adj>$ ^stuff/stuff<n>$ ^from/from<pr>$ ^Target/Target<np>$' | apertium-extract-caps
[[c:AA/aa]]^the<det>$ [[c:Aa/aa]]^green<adj>$ [[c:aa/aa]]^stuff<n>$ [[c:aa/aa]]^from<pr>$ [[c:Aa/Aa]]^Target<np>$

To add it to a modes.xml file, put the following after the part-of-speech tagger:

<program name="apertium-extract-caps"/>

And make the following changes to the preceding programs, if applicable:

  • Add -w to lt-proc or hfst-proc
  • Remove -w and -n from cg-proc
  • Add -p to apertium-tagger

If a later stage of the pipeline needs the surface forms, add the -s option to apertium-extract-caps.

Restoring Capitalization[edit]

The program apertium-restore-caps uses the information from apertium-extract-caps and the dictionary case of the target language forms to determine the proper capitalization of the target language surface forms according to a set of rules whose format is described below.

# supposing an input like "Long cat"
echo '[[c:aa/aa]]^gata<n>/gata$ [[c:Aa/aa]]^larga<adj>/larga$' | apertium-restore-caps abc-xyz.crx.bin
[[c:aa/aa]]Gata[[/]] [[c:Aa/aa]]larga[[/]]

This module converts from stream format to plain text with blanks, so all modules before it must output stream format and none of the modules after it should expect stream format. Specifically:

  • Generation should be done with lt-proc -b or hfst-proc -b rather than -g
  • Postgeneration using lt-proc -p should come after
  • Postgeneration using lsx-proc -p should come before
  • Preferences should come before, since apertium-restore-caps expects only 1 surface form

To add it, include the following in modes.xml:

<program name="apertium-restore-caps">
  <file name="abc-xyz.crx.bin"/>
</program>

And add this recipe to Makefile.am:

$(PREFIX1).crx.bin: $(BASENAME).$(PREFIX1).crx
	apertium-validate-crx $<
	apertium-compile-caps $< $@

And also add $(PREFIX1).crx.bin to TARGETS_COMMON.

Capitalization Restoration Rule Format[edit]

The format of the CRX rules is modeled after the LRX format.

CRX files follow this format:

 1 <?xml version="1.0"?>
 2 <capitalization>
 3   <rules>
 4     <rule weight="0.1">
 5       <match select="aa"/>
 6     </rule>
 7     <rule weight="1.0">
 8       <or>
 9         <match lemma="iPhone" select="dix"/>
10         <match lemma="iPod" select="dix"/>
11       </or>
12     </rule>
13     <rule weight="0.9">
14       <match tags="np.*" select="Aa"/>
15     </rule>
16   </rules>
17 </capitalization>

Rule weights may be any positive real number. If two rules give conflicting instructions for a particular word, the one with the higher weight will be used (or the higher sum, if multiple rules agree).

<match/>[edit]

The <match/> tag matches a single token in the input stream.

Given an input like

[[c:AA/Aa]]^xyz<n><sg>/abc$

The various pieces of this can be matched with the following attributes:

Pattern attributes for <match/>
Attribute Description Example Input Example Rule Notes
lemma Target language lemma xyz lemma="xyz" Case-sensitive and cannot be combined with trglem; use * to get prefix or suffix match
tags Target language tags <n><sg> tags="n.sg" ? matches any single tag, * 0 or more, and + 1 or more
surface Target language surface form abc surface="abc" Case-sensitive and cannot be combined with trgsurf; use * to get prefix or suffix match
srcsurf Source language surface case AA srcsurf="AA" Currently can only be AA, Aa, or aa
srclem Source language dictionary case Aa srcsurf="Aa" Currently can only be AA, Aa, or aa
trglem Target language dictionary case aa (case of xyz) srcsurf="aa" Cannot be combined with lemma
trgsurf Target language surface case aa (case of abc) trgsurf="aa" Cannot be combined with surface

In trglem and trgsurf, A will match capital letters, a will match any other letters, * will match any characters at all, and will match spaces. All of these must match at least 1 character.

Thus words with unusual dictionary capitalization can be matched with

1 <or>
2   <match trglem="aA"/>   <!-- nnnTTT (I can't think of a real example right now) -->
3   <match trglem="*aA"/>  <!-- NooJ -->
4   <match trglem="aA*"/>  <!-- iPhone -->
5   <match trglem="*aA*"/> <!-- LaTeX -->
6 </or>

Finally, the output case of a word can be determined using the select attribute. The allowed values are dix (use the output of the generator directly), AA (make the word all-caps), Aa (capitalize the word), and aa (lowercase the word).

<begin/>[edit]

The <begin/> tag matches the beginning of the input or immediately after the null character. For example, the following rule sets the first word of a sentence to uppercase:

1 <rule weight="1.0">
2   <or>
3     <begin/>
4     <match tags="sent"/>
5   </or>
6   <match select="Aa"/>
7 </rule>

<or>[edit]

1 <or>
2   <match tags="sent"/>
3   <match tags="cm"/>
4 </or>

This will match either a period or a comma.

<repeat>[edit]

1 <repeat from="1" upto="3">
2   <match tags="sent"/>
3 </repeat>

This will match 1, 2, or 3 period tokens. Both from and upto are required.