Capitalization restoration
This page describes the modules for extracting and restoring capitalization information which were added in apertium 3.9.0.
Using these modules, pipelines can be changed so that every step after the analyzer operates solely on dictionary case and there is no need to track the proper placement of case in transfer.
Contents
Extracting Capitalization
The program apertium-extract-caps
stores information about the source language capitalization in word-bound blanks with the prefix c:
.
echo '^THE/the<det>$ ^Green/green<adj>$ ^stuff/stuff<n>$ ^from/from<pr>$ ^Target/Target<np>$' | apertium-extract-caps [[c:AA/aa]]^the<det>$ [[c:Aa/aa]]^green<adj>$ [[c:aa/aa]]^stuff<n>$ [[c:aa/aa]]^from<pr>$ [[c:Aa/Aa]]^Target<np>$
To add it to a modes.xml file, put the following after the part-of-speech tagger:
<program name="apertium-extract-caps"/>
And make the following changes to the preceding programs, if applicable:
- Add
-w
tolt-proc
orhfst-proc
- Remove
-w
and-n
fromcg-proc
- Add
-p
toapertium-tagger
If a later stage of the pipeline needs the surface forms, add the -s
option to apertium-extract-caps
.
Restoring Capitalization
The program apertium-restore-caps
uses the information from apertium-extract-caps
and the dictionary case of the target language forms to determine the proper capitalization of the target language surface forms according to a set of rules whose format is described below.
# supposing an input like "Long cat" echo '[[c:aa/aa]]^gata<n>/gata$ [[c:Aa/aa]]^larga<adj>/larga$' | apertium-restore-caps abc-xyz.crx.bin [[c:aa/aa]]Gata[[/]] [[c:Aa/aa]]larga[[/]]
This module converts from stream format to plain text with blanks, so all modules before it must output stream format and none of the modules after it should expect stream format. Specifically:
- Generation should be done with
lt-proc -b
orhfst-proc -b
rather than-g
- Postgeneration using
lt-proc -p
should come after - Postgeneration using
lsx-proc -p
should come before - Preferences should come before, since
apertium-restore-caps
expects only 1 surface form
To add it, include the following in modes.xml:
<program name="apertium-restore-caps"> <file name="abc-xyz.crx.bin"/> </program>
And add this recipe to Makefile.am:
$(PREFIX1).crx.bin: $(BASENAME).$(PREFIX1).crx apertium-validate-crx $< apertium-compile-caps $< $@
And also add $(PREFIX1).crx.bin
to TARGETS_COMMON
.
Capitalization Restoration Rule Format
The format of the CRX rules is modeled after the LRX format.
CRX files follow this format:
1 <?xml version="1.0"?>
2 <capitalization>
3 <rules>
4 <rule weight="0.1">
5 <match select="aa"/>
6 </rule>
7 <rule weight="1.0">
8 <or>
9 <match lemma="iPhone" select="dix"/>
10 <match lemma="iPod" select="dix"/>
11 </or>
12 </rule>
13 <rule weight="0.9">
14 <match tags="np.*" select="Aa"/>
15 </rule>
16 </rules>
17 </capitalization>
Rule weights may be any positive real number. If two rules give conflicting instructions for a particular word, the one with the higher weight will be used (or the higher sum, if multiple rules agree).
<match/>
The <match/>
tag matches a single token in the input stream.
Given an input like
[[c:AA/Aa]]^xyz<n><sg>/abc$
The various pieces of this can be matched with the following attributes:
Attribute | Description | Example Input | Example Rule | Notes |
---|---|---|---|---|
lemma |
Target language lemma | xyz |
lemma="xyz" |
Case-sensitive and cannot be combined with trglem ; use * to get prefix or suffix match
|
tags |
Target language tags | <n><sg> |
tags="n.sg" |
? matches any single tag, * 0 or more, and + 1 or more
|
surface |
Target language surface form | abc |
surface="abc" |
Case-sensitive and cannot be combined with trgsurf ; use * to get prefix or suffix match
|
srcsurf |
Source language surface case | AA |
srcsurf="AA" |
Currently can only be AA , Aa , or aa
|
srclem |
Source language dictionary case | Aa |
srcsurf="Aa" |
Currently can only be AA , Aa , or aa
|
trglem |
Target language dictionary case | aa (case of xyz ) |
srcsurf="aa" |
Cannot be combined with lemma
|
trgsurf |
Target language surface case | aa (case of abc ) |
trgsurf="aa" |
Cannot be combined with surface
|
In trglem
and trgsurf
, A
will match capital letters, a
will match any other letters, *
will match any characters at all, and
will match spaces. All of these must match at least 1 character.
Thus words with unusual dictionary capitalization can be matched with
1 <or>
2 <match trglem="aA"/> <!-- nnnTTT (I can't think of a real example right now) -->
3 <match trglem="*aA"/> <!-- NooJ -->
4 <match trglem="aA*"/> <!-- iPhone -->
5 <match trglem="*aA*"/> <!-- LaTeX -->
6 </or>
Finally, the output case of a word can be determined using the select
attribute. The allowed values are dix
(use the output of the generator directly), AA
(make the word all-caps), Aa
(capitalize the word), and aa
(lowercase the word).
<begin/>
The <begin/>
tag matches the beginning of the input or immediately after the null character. For example, the following rule sets the first word of a sentence to uppercase:
1 <rule weight="1.0">
2 <or>
3 <begin/>
4 <match tags="sent"/>
5 </or>
6 <match select="Aa"/>
7 </rule>
<or>
1 <or>
2 <match tags="sent"/>
3 <match tags="cm"/>
4 </or>
This will match either a period or a comma.
<repeat>
1 <repeat from="1" upto="3">
2 <match tags="sent"/>
3 </repeat>
This will match 1, 2, or 3 period tokens. Both from
and upto
are required.