Improper locale definitions

A C++ implementation defines both:
1) the named locales that are available and
2) the definitions of the facets contained in each locale.

The C++ standard has no say in the matter (except that each implementation must define the classic "C" locale.

Thus, even if a particular locale is defined by an implementation, there is no guarantee that its contained facets encapsulate conventions appropriate for the culture indicated by the locale's name.

For example, consider coliru. It defines the "de_DE.utf8" named locale. This is expected to be the locale that encapsulates German conventions. ("de" stands for Deutsch, the German name for "German".) However, this isn't the case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <locale>
#include <iostream>

using namespace std;


int main()
{
    locale german {"de_DE.utf8"};
    
    const numpunct<char>& fac = 
             use_facet<numpunct<char>> (german);

    cout << "decimal point in German: "
         << fac.decimal_point() << endl << endl;
    
    cout << "true in German: " << fac.truename() << endl;
    cout << "false in German: " << fac.falsename() << endl ;
}


http://coliru.stacked-crooked.com/a/8e9de2cb88d03c6f

Here's the output:


decimal point in German: ,

true in German: true
false in German: false


We see the following:
1) The decimal point in German is defined correctly as , rather than .
2) The German names for "true" and "false" are "wahr" and "falsch", but the locale defines them as "true" and "false" instead.

Thus, we really can't depend on an implementation's definition of a POSIX or Microsoft locale.

To get a locale's correct definition, the following are perhaps the only options:
1) Use boost.locale in conjunction with ICU.
2) If interested in a certain set of conventions, write custom facets, generate a new locale from an existing one plus the custom facets and use the custom locale.

The 2nd option would be more efficient in space.

By the way, if I need to create a new locale containing 2 new facets, are 2 steps required?
1) Create locale1 from an existing one, adding the 1st new facet.
2) Create locale2 from locale1, adding the 2nd new facet.

This is because the available locale ctor allows us to add only 1 facet pointer at a time.

Thanks.
The German names for "true" and "false" are "wahr" and "falsch", but the locale defines them as "true" and "false" instead.

It's a bit of an oddity, but numpunct's truename/falsename are requires to always return "true"/L"true" and "false"/L"false", in every system-supplied locale. You can of course override that behavior with your own facet, but by default they are locale-independent, like isxdigit.

To get a locale's correct definition, the following are perhaps the only options:
1) Use boost.locale in conjunction with ICU.
2) If interested in a certain set of conventions, write custom facets, generate a new locale from an existing one plus the custom facets and use the custom locale.

3) file a bug report to your C library vendor.
Last edited on

2) If interested in a certain set of conventions, write custom facets, generate a new locale from an existing one plus the custom facets and use the custom locale.


In reality, this isn't as simple as it sounds. Imagine if I get a "german" locale:

 
locale german {"de_DE.utf8"};


However, in reality, I could never be sure that the locale in fact encapsulates German conventions. For instance, the collate<> facet might not sort, taking into account the German character set (which includes diacritical marks such as umlauts, as well as the "sharp S" - the eszet character).

Then, detailed testing of each locale would be necessary, which would be difficult.

It would be like finding out that the standard sort() algorithm, in fact doesn't work, for a particular implementation.

Also, we would need to manually define the semantics for collate<> for the German locale.



3) file a bug report to your C library vendor.


It would take time for the vendor to implement a solution.
I could never be sure that the locale in fact encapsulates German conventions. For instance, the collate<> facet might not sort, taking into account the German character set

That would be a really major bug in your C library! Works for me on Linux (with glibc) just fine.. If you're worried, set up a unit test as part of CI.

3) file a bug report to your C library vendor.
It would take time for the vendor to implement a solution.

If you don't report, who will? GNU libc is actually pretty good there: they even sort collation units correctly (three-character cluster "dzs" in Hungarian locales sorts after "dz" and before "g"), although it has its issues, see https://sourceware.org/bugzilla/show_bug.cgi?id=14095
Topic archived. No new replies allowed.