Coding in Unicode

I was looking into the source of a physics model in VS2008 today and noticed that non-ascii characters were being used.

1
2

CL_γ = CL_α + β;
ξ_0 = CL_γ * cos( 2*π*ω0 + φ );

Is it part of the C++ standard to support wide-characters or is this a compiler-specific feature? Does anyone else use this for math or to perhaps code in Chinese/Japanese? I'm curious here.

helios (17607)

I believe the standard does say at one point that identifiers can only consist of Latin alphanumeric characters and underscores. It doesn't leave this detail up to the implementation.
So, this would be an extension.

Generally speaking, using non-ASCII characters in a source file is a bad idea if you're going for portability, because a compiler may support whatever encodings (e.g. UTF-8, UTF-16) or code pages (Unicode, UCS, Shift JIS) it wants.

JLBorges (13770)

> Is it part of the C++ standard to support wide-characters

Yes. Though the terminology used by the standard is UCS (aka ISO 10646).
http://en.wikipedia.org/wiki/Universal_Character_Set

In particular, the standard has no requirement that the encoding or the glyphs corresponding to the ASCII character set must be used by an implementation.

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
...
Footnote: The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source ﬁle characters to the source character set (described in translation phase 1) is specified as implementation-deﬁned, an implementation is required to document how the basic source characters are represented in source files.

The universal-character-name construct provides a way to name other characters. ...

And:

An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in E.2.

Footnote: On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name may be used in forming valid external identifiers. ...

> Does anyone else use this for math or to perhaps code in Chinese/Japanese?

Yes, people do. Though, usually for code that is used internationally, there would be a coding guideline which says something like: 'The characters used for identifiers should be limited to the characters in the basic source character set.'

Last edited on

Stewbond (2827)

Thanks both. That satisfies my interests.

Topic archived. No new replies allowed.