Letter case handling

From Apertium
Jump to navigation Jump to search

En français

The same input word in a lexical processing module can be written differently regarding letter case. The most frequent cases are:

  1. The whole word is in lower case.
    e.g. beer
  2. The whole word is in upper case.
    e.g. IBM
  3. The first letter is capitalised and the rest is in lower case (typical case for proper nouns)
    e.g. Peter
  4. The word contains a jumble of cases,
    e.g. LaTeX

The transductions in the dictionary can also be found in these three states. The way in which one word is written in the dictionary is used to discard possible analysis of the word, according to the following rules:

  1. If the input letter is upper case and in the current analysis state there are concordant transitions in lower case, these transductions are made.
  2. If the input letter is lower case and in the current state there are not concordant transitions in lower case, the transductions are not made.

Thanks to this policy, a surface form that is not capitalised can not be analysed as a proper noun.

The case of an input word will be maintained in the output of the translator unless it is decided not to do so. The case can be changed in the structural transfer module; this option is useful, for example, when there is a reordering of words or when a word is added before a capitalised word at the beginning of a sentence, such as in the translation of the Catalan phrase Vindran into English: They will come.

Examples

Given the examples above, and the dictionary which makes the lt-expand output that follows,

beer:beer<n><sg>
IBM:IBM<np><org><sg>
Peter:Peter<np><ant><m><sg>
LaTeX:LaTeX<np><al><sg>

The following table gives the analyses that would be output in regular case-handling mode.

Input Dictionary Output
beer beer ^beer/beer<n><sg>$
BEER beer ^BEER/BEER<n><sg>$
Beer beer ^Beer/Beer<n><sg>$
beeR beer ^beeR/beer<n><sg>$
BeeR beer ^BeeR/BEER<n><sg>$
BeEr beer ^BeEr/Beer<n><sg>$
IBM IBM ^IBM/IBM<np><org><sg>$
ibm IBM ^ibm/*ibm$
Ibm IBM ^Ibm/*Ibm$
IBm IBM ^IBm/*IBm$
Peter Peter ^Peter/Peter<np><ant><m><sg>$
peter Peter ^peter/*peter$
PEter Peter ^PEter/PEter<np><ant><m><sg>$
PETER Peter ^PETER/PETER<np><ant><m><sg>$
LaTeX LaTeX ^LaTeX/LaTeX<np><al><sg>$
LateX LaTeX ^LateX/*LateX$
Latex LaTeX ^Latex/*Latex$
latex LaTeX ^latex/*latex$
LATEX LaTeX ^LATEX/LATEX<np><al><sg>$

Keeping dictionary case

By giving the -w (or --dictionary-case) option to lt-proc, the letter case normalisation doesn't happen, so eg. "BeeR" will get the analysis ^BeeR/beer<n><sg>$. This is useful in connection with Constraint Grammar. If case is normalised by lt-proc, rules which refer to the lemma "beer" would have to also refer to "BeeR" (and "BeEr" and "bEEr" etc, typically using the case insensitivity option, which slows down analysis). By using -w, the lemma keeps dictionary case after analysis.

However, we do want letter case normalisation before transfer; fortunately cg-proc can do this for us, just pass the -w (or --wordform-case) option to cg-proc. The end result should be the same as when just running lt-proc alone.

How acronyms are dealt with

The cg-proc -w option already outputs this:

in: JEG/jeg<prn>, out: JEG/JEG<prn>  
in: JeG/jeg<prn>, out: JeG/JEG<prn> 
in: jeG/jeg<prn>, out: jeG/jeg<prn> 
in: Jeg/jeg<prn>, out: Jeg/Jeg<prn> 
in: jeg/jeg<prn>, out: jeg/jeg<prn>

But we can't just look at the first and last character if the lemma is eg. an acronym, we have to look at the first lowercase character in the lemma (baseform):

  1. in: bcg-vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-vaksine
  2. in: BCG-vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-vaksine
  3. in: BCG-VAKSINE/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-VAKSINE
  4. in: Bcg-vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-vaksine
  5. in: Bcg-Vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-Vaksine

so in 3. above, the first lowercase character is the 'v', if _that_ one is uppercased and the final one is, we uppercase. If that one is uppercased while the final one is lowercased, as in 5 above, we capitalise.