Dereferencing string pointers

First post !!!

Hi,

I'm new to c++ but not in the programming world so I put my question in the beginners forum as a guess.

My question is about a piece of code for a Windows registry wrapper I'm writing but mostly related to vc++ behavior about dereferencing string pointers.
In the piece of code below, I would like to know what are the underlying specifications for the total of bytes read in a dereferenciation.
For example if lpData is a LPWSTR pointer (WCHAR) and I use *(lpData + dwDataSize) then it reads 2 bytes but the same with *(unsigned*)(lpData + dwDataSize) reads 4 bytes.

1) What should I know about it
2) Is it a behavior I can trust to be the same all the time

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
BOOL HlpRegSetValueSzW(HKEY PredefinedKey, LPCWSTR lpSubKey, LPCWSTR lpValueName, LPCWSTR lpData, DWORD dwType)
{
	HKEY hKey = NULL;
	DWORD dwDataSize = 0;                                 // very important initialization = 0

	// Check if dwType is a supported string value types and while we're at it, let's compute the byte size
	// including the string null terminator(s) of lpData that is needed for the cbData parameter of RegSetValueEx
	
	if ((dwType == REG_SZ) || (dwType == REG_EXPAND_SZ))  // one null-char terminator strings
	{
		while (*(lpData + dwDataSize))                    // read 2 bytes (if 0 then we've found the null char)
		{
			dwDataSize++;                                 // at loop exit dwDataSize holds the string length
		}
		dwDataSize = (dwDataSize + 1) * sizeof(WCHAR);    // add 1 null-char and multiply by 2 for bytes count
	}
	else if (dwType == REG_MULTI_SZ)                      // two null-chars terminator strings
	{
		while (*(unsigned*)(lpData + dwDataSize))         // read 4 bytes (if 0 then we've found the 2 null chars)
		{
			dwDataSize++;
		}
		dwDataSize = (dwDataSize + 2) * sizeof(WCHAR);    // add 2 null-chars and multiply by 2 for bytes count
	}
	else
	{
		return FALSE;
	}

...
From a clarity standpoint... foo[index] is much more clear than *(foo + index). Every time I see the latter, I die a little inside.

That said...

1) What should I know about it
2) Is it a behavior I can trust to be the same all the time


You are assuming sizeof(WCHAR)*2 == sizeof(unsigned), which is probably true for whatever compiler you're using, but is not guaranteed. So this code will probably work fine for now, but might break if compiled under a different configuration.


As for whether or not you can "trust" it... I guess it depends. Realistically it'll likely always work, but that isn't guaranteed by the language... so it's conceivable that it might not work, even if that's extremely unlikely. Personally, I would say "no, don't trust it", but it's a judgement call.


I would probably change this to use [array indexes] rather than pointer math, and avoid casts completely:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
if ((dwType == REG_SZ) || (dwType == REG_EXPAND_SZ))
{
    while ( lpData[dwDataSize] )
    {
        dwDataSize++;
    }
    dwDataSize = (dwDataSize + 1) * sizeof(WCHAR);
}
else if (dwType == REG_MULTI_SZ)
{
    while ( lpData[dwDataSize] || lpData[dwDataSize+1] )
    {
        dwDataSize++;
    }
    dwDataSize = (dwDataSize + 2) * sizeof(WCHAR);
}
Last edited on
Disch,

You're probably right about readability but I found nothing else for what I want in the assembly. My function doesn't need optimizations of course but for later coding I want to have an idea of what's in the assembly.

This...

1
2
3
4
5
6
7
{
    while ( lpData[dwDataSize] || lpData[dwDataSize+1] )
    {
        dwDataSize++;
    }
    dwDataSize = (dwDataSize + 2) * sizeof(WCHAR);
}


in the assembly is translated to 2 "cmp word ptr" and mine into 1 "cmp dword ptr". That was the goal.

Does that matter? If you prefer control over the assembly more than readable code, then why are you using C++? Why not use assembly? ;P

IMO readable code is the most important. You should not sacrifice readability for performance unless it's a significant difference. I highly doubt the extra cmp will make any impact on the resulting program.


But like I say... it's a preference. That's my preference. Maybe yours is different. And that's fine. There's a tradeoff here and it's your decision as to which route you feel best suits your needs.

So to recap the situation as I see it:

option A: while (*(unsigned*)(lpData + dwDataSize))

Advantages to option A:
- might run ever-so-slightly faster.



option B: while ( lpData[dwDataSize] || lpData[dwDataSize+1] )

Advantages to option B:
- Easier to read/understand. More clearly represents the logic of what you're actually trying to do (that is: look for 2 consecutive nulls)
- Does not make assumptions about the size of WCHAR or the size of unsigned and therefore is ever-so-slightly more portable.

Does that matter? If you prefer control over the assembly more than readable code, then why are you using C++? Why not use assembly? ;P


When use to it the 32bits macro assembler of microsoft was "for me" a great dev tool but the 64bits version of masm (and the whole 64bits architecture) has made assembly development at low level not as easy as it was. C++ is a powerful language and I'm making the switch. Maybe I should go easy with c++ but I can't erase many years with a finger snap.


I made a little change for readability and not assuming anything on unsigned :
while (*(DWORD*)(lpData + dwDataSize))

Recasting is not that bad but one should known was he's doing with it. At the end, there's a limit to write idiots proof code :-)

When I post that question, I thought that maybe a reply would appear saying DON'T EVER DO THAT!!! BECAUSE...
I learn c++ slowly in my spear time and a lot more learning to come.

By the way Disch, thanks for your comments.


Last edited on
I made a little change for readability and not assuming anything on unsigned :
while (*(DWORD*)(lpData + dwDataSize))


You're still making the assumption about WCHAR (assuming it is 2 bytes).

Though again that is a reasonably "safe" assumption on Windows. Though it certainly is not true on other systems (for example, wchar_t is typically 4 bytes wide on *nix).
> Recasting is not that bad

If lpData is not aligned on an alignof(DWORD) boundary, the result of the cast is unspecified; we would hopefully get a pointer to a mis-aligned object which would work, albeit with a performance penalty.

In the spirit of C (and assembly):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// compute the byte size including the string null terminator(s) of lpData
auto p = lpData ;
switch(dwType)
{
    case REG_SZ:
    case REG_EXPAND_SZ:
        while( *p++ ) ;
        break ;

    case REG_MULTI_SZ:
        while( *p++ || *p++ ) ;
        break ;

    default: return false ;
}

const std::ptrdiff_t dwDataSize = ( p - lpData ) * sizeof(*lpData) ;

http://coliru.stacked-crooked.com/a/8f686e7da5e07731

That is clever.

Read and increment the pointer at the same time... great.

That kind of samples gives me ideas. I'll take some time to digest that and play with it.

Thanks JLBorges.
> I'll take some time to digest that and play with it.

It would help if you turn your attention to these three points of interest:

1. Precedence: http://en.cppreference.com/w/cpp/language/operator_precedence

The postfix ++ operator has a higher precedence than the unary * (dereference) operator (inferred from grammar)


2. Order of evaluation http://en.cppreference.com/w/cpp/language/eval_order

a. The value computation of the built-in postincrement and postdecrement operators is sequenced before its side-effect.

b. Every value computation and side effect of the first (left) argument of the built-in logical AND operator && and the built-in logical OR operator || is sequenced before every value computation and side effect of the second (right) argument.
Last edited on

( p - lpData ) * sizeof(*lpData)

When substracting last+1 and first adress we should get the bytes count but here we get the caracters count. When they say c++ is highly typed I understand :-)

Last edited on
Topic archived. No new replies allowed.