Capitalization restoration
Revision as of 18:32, 22 December 2022 by Popcorndude (talk | contribs) (start documenting new capitalization module)
This page describes the modules for extracting and restoring capitalization information which were added in apertium 3.9.0.
Using these modules, pipelines can be changed so that every step after the analyzer operates solely on dictionary case and there is no need to track the proper placement of case in transfer.
Extracting Capitalization
The program apertium-extract-caps
stores information about the source language capitalization in word-bound blanks with the prefix c:
.
echo '^THE/the<det>$ ^Green/green<adj>$ ^stuff/stuff<n>$ ^from/from<pr>$ ^Target/Target<np>$' | apertium-extract-caps [[c:AA/aa]]^the<det>$ [[c:Aa/aa]]^green<adj>$ [[c:aa/aa]]^stuff<n>$ [[c:aa/aa]]^from<pr>$ [[c:Aa/Aa]]^Target<np>$
To add it to a modes.xml file, put the following after the part-of-speech tagger:
<program name="apertium-extract-caps"/>
And make the following changes to the preceding programs, if applicable:
- Add
-w
tolt-proc
orhfst-proc
- Remove
-w
and-n
fromcg-proc
- Add
-p
toapertium-tagger
If a later stage of the pipeline needs the surface forms, add the -s
option to apertium-extract-caps
.