Boost.Locale
|
The C++ standard library offers a simple and powerful way to provide locale-specific information. It is done via the std::locale
class, the container that holds all the required information about a specific culture, such as number formatting patterns, date and time formatting, currency, case conversion etc.
All this information is provided by facets, special classes derived from the std::locale::facet
base class. Such facets are packed into the std::locale
class and allow you to provide arbitrary information about the locale. The std::locale
class keeps reference counters on installed facets and can be efficiently copied.
Each facet that was installed into the std::locale
object can be fetched using the std::use_facet
function. For example, the std::ctype<Char>
facet provides rules for case conversion, so you can convert a character to upper-case like this:
A locale object can be imbued into an iostream
so it would format information according to the locale:
Would display:
1,345.45 1.345,45
You can also create your own facets and install them into existing locale objects. For example:
And now you can simply provide this information to a locale:
Now you can print a distance according to the correct locale:
This technique was adopted by the Boost.Locale library in order to provide powerful and correct localization. Instead of using the very limited C++ standard library facets, it uses ICU under the hood to create its own much more powerful ones.
There are numerous issues in the standard library that prevent the use of its full power, and there are several additional issues:
test.csv
? It may be "1.1,1.3" or it may be "1,1,1,3" rather than what you had expected. printf
and libraries like boost::lexical_cast
giving incorrect or unexpected formatting. In fact many third-party libraries are broken in such a situation. std
based localization backends, so by default, numbers are always formatted using C-style locale. Localized number formatting requires specific flags. ru_RU.UTF-8
locale number 1024 should be displayed as "1 024" where the space is a Unicode character with codepoint u00A0. Unfortunately many libraries don't handle this correctly, for example GCC and SunStudio display a "\xC2" character instead of the first character in the UTF-8 sequence "\xC2\xA0" that represents this code point, and actually generate invalid UTF-8. en-US
or English_USA.1252
, when on POSIX platforms it would be en_US.UTF-8
or en_US.ISO-8859-1