Which features/libraries to prefer/Semantics
New code should prefer modern C++ using C++03. Here, modern C++ is defined in opposition to "C with classes" style. (Note, there is quite a bit of existing code which would probably qualify as "C with classes".) In practice, for our purposes, This means:
- Use const and references where possible
- Prefer C++ casts over C casts
- Prefer the C++ stdlib over the C standard library
- Prefer containers over home made data structures.
- Use void*
- There are lots of wstrings kept in memory about place. These are UTF-16 on Windows and UTF-32 on Linux.
- There are also char* and strings which are UTF-8 encoded kept in memory in some places.
- Mixing wide and narrow character streams is forbidden by the standard (http://stackoverflow.com/questions/8947949/mixing-cout-and-wcout-in-same-program) and also can also cause real problems in practice (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42552 https://sourceforge.net/p/apertium/tickets/106/).
- Use UTF-8 for serialising to files.
- Use wcerr for outputting errors.
- In terms of internal encodings, go with the flow for now.
- You might have to use either wcout or cout depending on the situation but they shouldn't ever mix within the execution of a program. To help with this, follow the following rule: If you're expecting the program's output to be piped - don't output to wcout at all, just cout. If you're expecting the the output to end up on a terminal, use wcout.
- If you have a UTF-8 string rather than an 7-bit clean ANSI string, you should UtfConverter::fromUtf8 it before outputting.
It might be nice in future to use utf-8 everywhere inside Apertium and re-encode strings only when necessary at API boundaries. (Eg for stdio, re-encode to wstring and use wcout/wcerr only in environments with non-utf8 locale otherwise use cout/cerr). This could then be wrapped in a thin portable abstraction. There's some information about this way of doing things here: http://utf8everywhere.org/
Exceptions are in use, but the code doesn't follow the rules which would ensure it was exception safe. The strictest type of exception safety means that for every new/delete pair, either an smart pointer (auto_ptr, unique_ptr) should be used, or they should be used in a constructor/destructor pair. Currently MorphoStream doesn't meet this standard. This isn't (yet) a problem since most exceptions result in the program ending (and the whole process being deallocated by the OS).
I'd like to use $LIB eg Boost:Wurble or remove ifdefs by using eg gnulib
It's going to make it impossible to build for language pair authors.
But if it's a dependency that's built in tree it's just a matter of getting the source there. It could be bootstrapped with a simple script or even (ick) vendorised into Subversion.
It could be considered if the benefit were great enough and there was sufficient confidence it wouldn't make building harder, but the balance of argument is heavily against it. Implement whatever you need from scratch if it's small, or find a small library that does just what's needed to vendorise. Double check the functionality isn't already a utility in Apertium or Lttoolbox or part of an existing dependency.
This is less important. Currently through the code base there are:
- Sergio's style; e.g. fst_processor.cc
- m5w's style; e.g. basic_stream_tagger_trainer.cc
- felipe's style; e.g. apertium_tagger.cc
Most code is Sergio's style. See Emacs_C_style_for_Apertium_hacking for an older attempt at formalising it.
Wiki TODO: Document this below.
Possible project TODO: Run clang-format
- loosely following the Clang standards for naming and such
- strictly for indenting -- clang-format
- default to Clang naming style, unless I'm talking about some kind of STL class or STL imitation
- example of imitation is the serialiser class for pairs which are named as such: first_Type, second_Type so we can do can do pair.first and .second
- keep the STL format for the part of the name that's STL
- but if it's followed by something not STL, then I finish out with an underscore and then go to CamelCase
- another divergence is with variables that don't have a whole lot of information e.g. something being passed to a serialiser: they get bland names such as Stream and SerialisedType
- since those are already used as type names -- just add a trailing underscore
- Braces on separate lines
- camelCase methods
- snake_case local and member variables
- UpperCamelCase class names