Difference between revisions of "Letter case handling"

Revision as of 18:34, 3 September 2011

The same input word in a lexical processing module can be written differently regarding letter case. The most frequent cases are:

The whole word is in lower case.
e.g. beer
The whole word is in upper case.
e.g. IBM
The first letter is capitalised and the rest is in lower case (typical case for proper nouns)
e.g. Peter
The word contains a jumble of cases,
e.g. LaTeX

The transductions in the dictionary can also be found in these three states. The way in which one word is written in the dictionary is used to discard possible analysis of the word, according to the following rules:

If the input letter is upper case and in the current analysis state there are concordant transitions in lower case, these transductions are made.
If the input letter is lower case and in the current state there are not concordant transitions in lower case, the transductions are not made.

Thanks to this policy, a surface form that is not capitalised can not be analysed as a proper noun.

The case of an input word will be maintained in the output of the translator unless it is decided not to do so. The case can be changed in the structural transfer module; this option is useful, for example, when there is a reordering of words or when a word is added before a capitalised word at the beginning of a sentence, such as in the translation of the Catalan phrase Vindran into English: They will come.

Examples

Given the examples above, and the dictionary which makes the lt-expand output that follows,

beer:beer<n><sg>
IBM:IBM<np><org><sg>
Peter:Peter<np><ant><m><sg>
LaTeX:LaTeX<np><al><sg>

The following table gives the analyses that would be output in regular case-handling mode.

Input	Dictionary	Output
beer	beer	`^beer/beer<n><sg>$`
BEER	beer	`^BEER/BEER<n><sg>$`
Beer	beer	`^Beer/Beer<n><sg>$`
beeR	beer	`^beeR/beer<n><sg>$`
BeeR	beer	`^BeeR/BEER<n><sg>$`
BeEr	beer	`^BeEr/Beer<n><sg>$`
IBM	IBM	`^IBM/IBM<np><org><sg>$`
ibm	IBM	`^ibm/*ibm$`
Ibm	IBM	`^Ibm/*Ibm$`
IBm	IBM	`^IBm/*IBm$`
Peter	Peter	`^Peter/Peter<np><ant><m><sg>$`
peter	Peter	`^peter/*peter$`
PEter	Peter	`^PEter/PEter<np><ant><m><sg>$`
PETER	Peter	`^PETER/PETER<np><ant><m><sg>$`
LaTeX	LaTeX	`^LaTeX/LaTeX<np><al><sg>$`
LateX	LaTeX	`^LateX/*LateX$`
Latex	LaTeX	`^Latex/*Latex$`
latex	LaTeX	`^latex/*latex$`
LATEX	LaTeX	`^LATEX/LATEX<np><al><sg>$`

Keeping dictionary case

By giving the -w (or --dictionary-case) option to lt-proc, the letter case normalisation doesn't happen, so eg. "BeeR" will get the analysis ^BeeR/beer<n><sg>$. This is useful in connection with Constraint Grammar. If case is normalised by lt-proc, rules which refer to the lemma "beer" would have to also refer to "BeeR" (and "BeEr" and "bEEr" etc, typically using the case insensitivity option, which slows down analysis). By using -w, the lemma keeps dictionary case after analysis.

However, we do want letter case normalisation before transfer; fortunately cg-proc can do this for us, just pass the -w (or --wordform-case) option to cg-proc. The end result should be the same as when just running lt-proc alone.

How acronyms are dealt with

The cg-proc -w option already outputs this:

in: JEG/jeg<prn>, out: JEG/JEG<prn>  
in: JeG/jeg<prn>, out: JeG/JEG<prn> 
in: jeG/jeg<prn>, out: jeG/jeg<prn> 
in: Jeg/jeg<prn>, out: Jeg/Jeg<prn> 
in: jeg/jeg<prn>, out: jeg/jeg<prn>

But we can't just look at the first and last character if the lemma is eg. an acronym, we have to look at the first lowercase character in the lemma (baseform):

in: bcg-vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-vaksine
in: BCG-vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-vaksine
in: BCG-VAKSINE/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-VAKSINE
in: Bcg-vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-vaksine
in: Bcg-Vaksine/BCG-vaksine<n><m><sg><ind> out: bcg-vaksine/BCG-Vaksine

so in 3. above, the first lowercase character is the 'v', if _that_ one is uppercased and the final one is, we uppercase. If that one is uppercased while the final one is lowercased, as in 5 above, we capitalise.

@@ Line 103: / Line 103: @@
 [[Category:Documentation]]
 [[Category:Letter case|*]]
+[[Category:Documentation in English]]

Difference between revisions of "Letter case handling"

Revision as of 18:34, 3 September 2011

Examples

Keeping dictionary case

How acronyms are dealt with

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools