Code style
Contents
C++
Which features/libraries to prefer/Semantics
New code should prefer modern C++ using C++03. Here, modern C++ is defined in opposition to "C with classes" style. (Note, there is quite a bit of existing code which would probably qualify as "C with classes".) In practice, for our purposes, This means:
Do
- Use const and references where possible
- Prefer C++ casts over C casts
- Prefer the C++ stdlib over the C standard library
- Prefer containers over home made data structures.
Don't
- Use void*
String encoding
Situation
- There are lots of wstrings kept in memory about place. These are UTF-16 on Windows and UTF-32 on Linux.
- There are also char* and strings which are UTF-8 encoded kept in memory in some places.
- Mixing wide and narrow character streams is forbidden by the standard (http://stackoverflow.com/questions/8947949/mixing-cout-and-wcout-in-same-program) and also can also cause real problems in practice (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42552 https://sourceforge.net/p/apertium/tickets/106/).
Approach
- Use UTF-8 for serialising to files.
- Use wcerr for outputting errors.
- In terms of internal encodings, go with the flow for now.
- You might have to use either wcout or cout depending on the situation but they shouldn't ever mix within the execution of a program. To help with this, follow the following rule: If you're expecting the program's output to be piped - don't output to wcout at all, just cout. If you're expecting the the output to end up on a terminal, use wcout.
- If you have a UTF-8 string rather than an 7-bit clean ANSI string, you should UtfConverter::fromUtf8 it before outputting.
Alternatives
It might be nice in future to use utf-8 everywhere inside Apertium and re-encode strings only when necessary at API boundaries. (Eg for stdio, re-encode to wstring and use wcout/wcerr only in environments with non-utf8 locale otherwise use cout/cerr). This could then be wrapped in a thin portable abstraction. There's some information about this way of doing things here: http://utf8everywhere.org/
I'd like to use $LIB eg Boost:Wurble or remove ifdefs by using eg gnulib
No.
Please?
No.
Why not?
It's going to make it impossible to build for language pair authors.
But if it's a dependency that's built in tree it's just a matter of getting the source there. It could be bootstrapped with a simple script or even (ick) vendorised into Subversion.
It could be considered if the benefit were great enough and there was sufficient confidence it wouldn't make building harder, but the balance of argument is heavily against it. Implement whatever you need from scratch if it's small, or find a small library that does just what's needed to vendorise. Double check the functionality isn't already a utility in Apertium or Lttoolbox or part of an existing dependency.
Formatting/Syntax
This is less important. Currently through the code base there are:
- Sergio's style; e.g. fst_processor.cc
- m5w's style; e.g. basic_stream_tagger_trainer.cc
- felipe's style; e.g. apertium_tagger.cc
Most code is Sergio's style. See Emacs_C_style_for_Apertium_hacking for an older attempt at formalising it.
Wiki TODO: Document this below.
Possible project TODO: Run clang-format
m5w
- loosely following the Clang standards for naming and such
- strictly for indenting -- clang-format
- default to Clang naming style, unless I'm talking about some kind of STL class or STL imitation
- example of imitation is the serialiser class for pairs which are named as such: first_Type, second_Type so we can do can do pair.first and .second
- keep the STL format for the part of the name that's STL
- but if it's followed by something not STL, then I finish out with an underscore and then go to CamelCase
- another divergence is with variables that don't have a whole lot of information e.g. something being passed to a serialiser: they get bland names such as Stream and SerialisedType
- since those are already used as type names -- just add a trailing underscore
Sergio
- Braces on separate lines
- camelCase methods
- snake_case local and member variables
- UpperCamelCase class names
Python
PEP-8?