Capitalization restoration

From Apertium
Revision as of 18:32, 22 December 2022 by Popcorndude (talk | contribs) (start documenting new capitalization module)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes the modules for extracting and restoring capitalization information which were added in apertium 3.9.0.

Using these modules, pipelines can be changed so that every step after the analyzer operates solely on dictionary case and there is no need to track the proper placement of case in transfer.

Extracting Capitalization

The program apertium-extract-caps stores information about the source language capitalization in word-bound blanks with the prefix c:.

echo '^THE/the<det>$ ^Green/green<adj>$ ^stuff/stuff<n>$ ^from/from<pr>$ ^Target/Target<np>$' | apertium-extract-caps
[[c:AA/aa]]^the<det>$ [[c:Aa/aa]]^green<adj>$ [[c:aa/aa]]^stuff<n>$ [[c:aa/aa]]^from<pr>$ [[c:Aa/Aa]]^Target<np>$

To add it to a modes.xml file, put the following after the part-of-speech tagger:

<program name="apertium-extract-caps"/>

And make the following changes to the preceding programs, if applicable:

  • Add -w to lt-proc or hfst-proc
  • Remove -w and -n from cg-proc
  • Add -p to apertium-tagger

If a later stage of the pipeline needs the surface forms, add the -s option to apertium-extract-caps.

Restoring Capitalization