Performance when using TCHAR

Pages: 12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#Compile Exe                 ' 1)Create 2,000,000 dashes (2,000,000 bytes);
#Dim All                     ' 2)Change every 7th dash to a "P";
%NUMBER      = 2000000       ' 3)replace every "P" with a "PU" (hehehe);
%LINE_LENGTH = 90            ' 4)replace every dash with an "8";
%RIGHT_BLOCK = 4000          ' 5)Put in a CrLf every 90 characters;
                             ' 6)Output last 4K Characters to Message Box;
Declare Function GetTickCount Lib "Kernel32.dll" Alias "GetTickCount" () As Dword

Function PBMain() As Long
  Local NUM_PS, PU_EXT_LENGTH, NUM_FULL_LINES, MAX_MEM As Long
  Register i As Long, iCount As Long
  Local s1, s2 As String
  Local currPos As Long
  Local tick As Dword

  tick = getTickCount
  NUM_PS         = %NUMBER/7                       'Think 'number Ps'
  PU_EXT_LENGTH  = %NUMBER+NUM_PS                  'How much longer will adding Us make it?
  NUM_FULL_LINES = PU_EXT_LENGTH/%LINE_LENGTH      'How many full lines after $CrLf every 90 bytes?
  MAX_MEM        = PU_EXT_LENGTH+NUM_FULL_LINES*2  'Maximum bytes needed.
  s1 = String$(%NUMBER, "-")                       '1) Create a 2000000 byte string of dashes;
  For i = 1 To %NUMBER
    Incr iCount
    If iCount = 7 Then
       iCount = 0
       Asc(s1, i) = &h50                           '2) Change every 7th dash to a "P";
    End If
  Next i
  Replace "P" With "PU" In s1                      '3) Replace every "P" with a "PU" (hehehe);
  Replace Any "-" With "8" In s1                   '4) Replace every null with an "8";
  s2 = String$(MAX_MEM, "-"$)
  currPos = 1  'for s2 position tracking
  For i = 1 To PU_EXT_LENGTH Step 90               '5) Put in a carriage return & line feed every 90
    Mid$(s2, currPos) = Mid$(s1, i, 90) & $CRLF    '   characters;
    currPos = currPos + 92
  Next i
  s2=Right$(s2,%RIGHT_BLOCK)                       '6) Output last 4K to Message Box.
  tick = getTickCount - tick
  MsgBox s2, %MB_OK, "Here's Your String John In " & Str$(tick) & " Ticks!"

  PBMain=0
End Function


That’s not exactly John’s original code – but close. Anyway, I was usually getting around 160 to 200 ticks on that with the same laptop I’m using here for comparison purposes. John was hoping someone could post a C or C++ version for comparison purposes, because that’s where this all started, i.e., a comparison of C verses PowerBASIC processing speed, as I had mentioned.

Now there aren’t a lot of C++ coders that hang around the PowerBASIC Forums. Quite a few asembler folks and a few C coders – but not too many C++ coders. Other than me that is. And I couldn’t offer much at the time even though I was interested and wanted too, because the big hang up I was having with John’s algorithm from a C++ standpoint was my String Class didn’t have anything that remotely did this …

 
Replace "P" With "PU" In s1


What that’s saying is for every P, and the Ps are every seventh character in the 2,000,000 byte string, replace it with PU, which will cause everything to the right of the P to shift over one character. So the string is going to grow. So just showing a few bytes, if s1 is originally this…

------P------P------P


which is 21 bytes, after that replace operation it should be this…


------PU------PU------PU


which is 24 bytes. Actually, the insertion of CrLfs every 90 characters will do the same. John is an expert assembler coder and knows how to inflick maximal pain on hardware and software! And he went right for the throat, knowing full well that the Achillies heel of C/C++ was their lack of a string data type and high level string handling.

So I had to tuck the whole thing away until I had the time to work on it which I eventually did. The first thing I did wasn’t to develop that program above that came in with 273,000 ticks to get the job done. That came later. The first thing I did was add a String::Replace() member to my String Class which acted not exactly but close to the PowerBASIC Replace Statement. Note in the PowerBASIC Replace Statement the input string is s1, and after the replacement s1 holds the new string. In implementing this within my string class I decided my Replace() member function would leave the original string, i.e., s1, alone, and return a new string with the replacements done, i.e., something like so…

s2 = s1.Replace(…..);

In terms of the parameters, I would only need two, i.e., the ‘match’ string, which in our case here is ‘P’, and the ‘new’ string that would be substituted where an occurance of the ‘match’ string was found – here ‘PU’.

QUESTION: How does one actually implement a C++ Class member function?

ANSWER: That’s a no brainer, dummy! With C code!

This is what I eventually worked out …

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
String String::Replace(TCHAR* pMatch, TCHAR* pNew)  //strncmp = _tcsncmp
{
 SIZE_T i,iLenMatch,iLenNew,iCountMatches,iExtra,iExtraLengthNeeded,iAllocation,iCtr;
 iLenMatch=(int)_tcslen(pMatch);           // how long is match string?
 iCountMatches=0, iAllocation=0, iCtr=0;
 iLenNew=(int)_tcslen(pNew);               // how long new string?
 if(iLenNew==0)                            // if pNew is a “” string, this resolves 
 {                                         // to a String::Remove() call.
    String sr=this->Remove(pMatch,true);   // return
    return sr;
 }
 else
 {
    iExtra=iLenNew-iLenMatch;              // how many extra characters needed for 
    for(i=0; i<this->iLen; i++)            // each match?  Then count matches in this
    {
        if(_tcsncmp(lpBuffer+i,pMatch,iLenMatch)==0)
           iCountMatches++;  //Count how many match strings
    }
    iExtraLengthNeeded=iCountMatches*iExtra;     // this will be hom much bigger the
    iAllocation=this->iLen+iExtraLengthNeeded;   // new string will be.  So call
    String sr(iAllocation,false);                // constructor for new string sr that
    for(i=0; i<this->iLen; i++)                  // will be returned.
    {
        if(_tcsncmp(this->lpBuffer+i,pMatch,iLenMatch)==0)  // work through both buffers
        {                                                   // one character at a time
           _tcscpy(sr.lpBuffer+iCtr,pNew);                  // with two counters running.
           iCtr+=iLenNew;                                   // increment the 2nd buffer
           i+=iLenMatch-1;                                  // counter to reflect length
        }                                                   // of replacement string.
        else
        {
           sr.lpBuffer[iCtr]=this->lpBuffer[i];
           iCtr++;
        }
        sr.lpBuffer[iCtr]=_T('\0');
    }
    sr.iLen=iCtr;
    return sr;
 }
}


Never really knew if it was all that good. But it worked. So when I finally got it to run on John’s algorithm I ended up with these timings …

1
2
3
4
5
6
7
8
Processing...
After Allocating s1(MAX_MEM)...         : 0
After Making 2,000,000 dashes...        : 16
After Assigning Ps Every 7th Char...    : 31
After Doing John's PU Thing...          : 328     <<< My String::Replace()
After Replacing Dashes With 8s...       : 531
After Big Concatenation With CrLfs...   : 16516
Finished!                               : 16516 


Not so good compared to the 160 – 200 ticks of the PowerBASIC program, but certainly way better than the 273,000 ticks of the other C++ program using the Std Lib’s String Class. However, at the time I didn’t know that; I had as of yet not developed that other program. After seeing output such as above my conclusion was that either C++ didn’t rate very well compared to PowerBASIC or my particular String Class sucked or both of the above. I mean were talking 16,000 ticks compared to 180 or so!

Continued …

By the way, here is the C++ program using my String Class that returned the 16000 ticks. Unlike the others I posted (except the PowerBASIC ones) you won’t be able to run this because you won’t have my String Class, but this is what the program looked like. I’m not planning to post my String Class unless anyone wants it. Its kind of long and needs about three of four posts to fit it all. If I would post it you would be able to run this then …

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#include "Windows.h"
#include <stdio.h>
#include "Strings.h"

enum                                              // Exercise
{                                                 // =======================================
 NUMBER         = 2000000,                        // 1)Create a 2MB string of dashes
 LINE_LENGTH    = 90,                             // 2)Change every 7th dash to a "P"
 RIGHT_BLOCK    = 4000,                           // 3)replace every "P" with a "PU" (hehehe)
 NUM_PS         = NUMBER/7+1,                     // 4)replace every dash with an "8"
 PU_EXT_LENGTH  = NUMBER+NUM_PS,                  // 5)Put in a CrLf every 90 characters
 NUM_FULL_LINES = PU_EXT_LENGTH/LINE_LENGTH,      // 6)Output last 4K to Message Box
 MAX_MEM        = PU_EXT_LENGTH+NUM_FULL_LINES*2
};

int main()
{
 String  CrLf("\r\n");
 register int iCtr=0;
 register int i=0;
 char szTitle[64];
 DWORD t1,t2;

 t1=GetTickCount(), t2=t1;
 puts("Processing...");
 String s1(MAX_MEM);
 t1=GetTickCount()-t1;
 printf("After Allocating s1(MAX_MEM)...         : %u\n",(unsigned)t1);
 t1=t2;
 s1.Make('-',NUMBER);
 t1=GetTickCount()-t1;
 printf("After Making 2,000,000 dashes...        : %u\n",(unsigned)t1);
 t1=t2;
 String s2(MAX_MEM);
 for(i=0; i<NUMBER; i++, iCtr++)
 {
     if(iCtr==7)
     {
        s1.SetChar(i,'P');
        iCtr=0;
     }
 }
 t1=GetTickCount()-t1;
 printf("After Assigning Ps Every 7th Char...    : %u\n",(unsigned)t1);
 t1=t2;
 s2=s1.Replace((char*)"P",(char*)"PU");
 t1=GetTickCount()-t1;
 printf("After Doing John's PU Thing...          : %u\n",(unsigned)t1);
 t1=t2;
 s1=s2.Replace((char*)"-",(char*)"8");
 t1=GetTickCount()-t1;
 printf("After Replacing Dashes With 8s...       : %u\n",(unsigned)t1);
 t1=t2;
 s2.SetChar(0,'\0');
 for(i=0; i<PU_EXT_LENGTH; i+=LINE_LENGTH)
     s2+=s1.Mid(i+1,LINE_LENGTH)+CrLf;
 t1=GetTickCount()-t1;
 printf("After Big Concatenation With CrLfs...   : %u\n",(unsigned)t1);
 t1=t2;
 s1=s2.Right(RIGHT_BLOCK);
 t1=GetTickCount()-t1;
 printf("Finished!                               : %u\n",(unsigned)t1);
 sprintf(szTitle,"Here's Your String In %d Ticks John!",(unsigned)t1);
 MessageBox(0,s1.lpStr(),szTitle,MB_OK);
 getchar();

 return 0;
}


So the next thing I decided to do was see if I could finally learn the C++ String Class from the C++ Std. Library to see how it should really be done. But I quickly ran into problems there as I couldn’t seem to find anything there that was a usable replacement for PowerBASIC’s Replace statement, or the one I finally coded for my String Class. To this day I still can’t believe that. I keep feeling in my ignorance I just missed it. After giving it a pretty fair amount of research and work I finally came across a function that seemed to do the trick in Bruce Eckel’s “Thinking In C++” book, Volume 2. That would be that ReplaceAll() function in my previous post with the 273,000 tick count. I was happy to find that function in a book by a famous author and C++ teacher. I figured it ought to be good. Afterall, he’s a friend of Charles Petzold, a former member of the C++ Standards Committee, and a well known C++ book author, for whatever that’s worth, and I figure that ought to be worth something.

So when I hit run on that program I expected to see instant results like with the PowerBASIC program. I figurred you dumb piker you’ll see how its really done now! I figured it would likely come in faster than the PowerBASIC program. I wasn’t prepared to wait four and a half minutes! If I recall, at the time I thought my computer locked up.

But eventually, I came to see that’s what the situation was. I still couldn’t believe it. So I took the whole issue over to www.daniweb.com and eventually Vijayan wrote me that C++ program using the C++ Std. Lib’s String Class which beat all my programs and the PowerBASIC program by a handy margin – like about twice as fast. That program – as modified by me, I posted previously. It was coming in around 90 ticks, which was about twice as fast as the PowerBASIC program.

If you take the time to look at the timings of my String Class as compared to the Std. Lib’s String Class using Bruce Eckel’s ReplaceAll() function, you’ll see something really, really strange. My String Class with its Replace() member is only taking like 300 ticks out of the 16,000 total to do the substitution of Ps with PUs. The ReplaceAll() function of Eckel’s is using almost its whole 273,000 ticks stuck in that one operation. Everything else its doing almost instantaneous. And my program is using almost its entire 16,000 ticks doing operator+ calls, i.e., string concatenation. The Std. Lib’s String Class is doing the whole concatenation at the end in an unbelievable 30 ticks!!!! So as you can see the behaviour of both these string classes is radically, radically different. To this day I haven’t been able to figure out how the Std. Lib’s String Class is doing that final operator+ concatenation at the end in 30 ticks. I’m in awe of it. The conclusion I finally came to is that its just magic, pure and simple and unexplainable.

It was only after all that thrashing around that I developed the C programs I posted two days ago with timings coming in around 30 to 50 ticks that kind of pushed the limits of GetTickCount()! So the way it finally ended up is that using pointers and low level memory access with the C Std. Lib’s ‘String Primitives’, i.e., strcpy(), strcat(), etc., I was able to get equivalent timings in the 30 – 40 ms range with both PowerBASIC and C.

All kidding aside (about the magic part), I put all kinds of efforts, for many weeks, trying to figure out how to get my operator+ overloads down to times approaching the Std. Lib’s string class, but I never could put any kind of dent in it. My code is taking 16,000 ticks to do what the Std. Lib’s string class is doing in 30 ticks. Of course, my code beat the Std Lib’s string class by a factor of 16 times in that total algorithm, but the reason it did this was because of Eckel’s horrendous ReplaceAll() function.

I tried inlining my operator+ members, various uses of the const parameter modifier here and there – everything I could think of. The only thing I didn’t try was deciphering the bstring and string open source code to see how it works. The only thing I’m pretty sure about is that to do what its doing its not actually making the operator+ overload member calls. By that I mean I think the compiler is actually optimizing them all away in the same way it optimized away the original loops of the test program I first described where it factored apart a numerical sequence to arive at a predetermined answer without actually doing any loop code. Recall where I described that where an asm guru disassembled the exe to see what it was doing?

Anyway, if I’m right about that, and I think I am, then that certainly says something of the calibre of coder who put that string class together in such a way that he/she knew the algorithm would tie into compiler optimization routines that would directly feed into C or asm string primitives and completely bypass member function calls. That’s why I essentially gave up trying to get my timings down any further on my operator+ member functions.
So do you see now why 15-16 ticks timer resolution isn’t a problem!?! In case you haven’t looked, that code took over 273,000 ticks to run the algorithm!
Err... My criticism was about how you benchmarked the code that does run in ~100 ms.

EDIT: This runs in 3.88 ms on my computer. The wide version takes only 107% as long.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#include <iostream>
#include <string>
#include <ctime>

#if 1
typedef char char_t;
#else
typedef wchar_t char_t;
#endif

typedef std::basic_string<char_t> str_t;

const size_t size = 1 << 21;

int main(){
	clock_t t0 = clock();
	str_t string;
	size_t size2 = size * 8 / 7;
	size2 = size2 + size2 / 90 * 2;
	string.reserve(size2);
	const unsigned N = 1000;
	for (int n = 0; n < N; n++){
		string.resize(size, '-');
		char_t *p = &string[0];
		size_t length = string.size();
		for (size_t i = 6; i < length; i += 7)
			p[i] = 'P';
		length = length * 8 / 7;
		string.resize(length);
		p = &string[0];
		for (size_t i = length; --i;)
			p[i] = p[i - i / 8];
		for (size_t i = 7; i < length; i += 8)
			p[i] = 'U';
		
		length += length / 90 * 2;
		string.resize(length);
		p = &string[0];
		for (size_t i = length; --i;)
			p[i] = p[i - i / 92 * 2];
		
		for (size_t i = 90; i < length; i += 92)
			p[i] = 13;
		for (size_t i = 91; i < length; i += 92)
			p[i] = 10;
	}
	clock_t t1 = clock();
	std::cout <<double(t1 - t0) / (double)CLOCKS_PER_SEC * 1000.0 / N<<std::endl;
}


EDIT 2: Fixed a couple errors.
Last edited on
Oh No! Now I've got someone else hooked on this! Turn back before its too late! This is very addictive! You could even lose your job by spending all your time on it. And Helios, you're doing way too much work! Exactly ...

1<<21 - 2000000

:)
That's just how confident I am in my skillz. I'm giving you 97152 characters of handicap. :)

By the way, what's the deal with this PowerBASIC thing? Paying for languages? In 2014?
Last edited on

That's just how confident I am in my skillz.


I don't have time to work on this right now, but I did a quick test and you've got the fastest version so far. On the old but good XP laptop on which I did the other timings I'm getting around 21 ms with your original code above without your later changes. My fastest PowerBASIC version seems to be coming in around 45 ms. Actually, the fastest code I had so far was a PowerBASIC version.

You know, your posting got me to thinking about an aspect of this that I hadn't really dwelled upon until now, but when I was actually working on this stuff about four years ago (Oh, why did I dredge nit up?), I was really just looking for a high level C++ replacement for the PowerBASIC Replace statement. And since no one has provided one yet I'm assumming there just isn't one. In other words, one has to resort to low level manipulatios with pointers (like I've done), or essentially the equivalent with various methods of the basic_string or std::string classes like you and Vijayan did. Lots of different methods can be tried, and what you just provided may indeed be the fastest possible. But what got lost in the shuffle is that none of the stuff I did with the C string primitives, or you or Vijayan with basic_string or std::string stuff addresses the fact that the C++ Standard Library's string class doesn't seem to have a simple and easy to use high level implementation of a simple Replace() method.

Note one very simple and easy to grasp implication of this. In that algorithm we've been playing with, the Ps are every seventh character. And everything else is at predetermined places too. But what if the Ps were randomly scattered about, i.e.,

---P---------------------PP--P-------------------------------------P

All of a sudden it gets trickier, doesn't it?

And I just had one truely horrible thought. When I was writting the above about the necessity for a general purpose routine, I wondered if in my String Class Replace() method I had covered the base of where the replacement string is smaller than the match string, i.e. replace

I Love PowerBASIC I Love PowerBASIC I Love PowerBASIC.....

with

I Love C++ I Love C++ I Love C++ .....

So S***! Now I've got to check that out! One more thing to do.

About PowerBASIC. I spend about half my time coding in PowewrBASIC and the other half in C++. Lately I've been moving more to C++ because I'm interested in 64 bit development, and PowerBASIC doesn't have a 64 bit compiler. What's worse, the genius creator of the language (Bob Zale) passed away about two years ago, and the company is in turmoil. So I don't know if they ever will develop a 64 bit compiler. The compiler is awesome. They sell them and people buy them. They have no trial versions or anything like that. Its written in assembler and runs through like 20 million lines per minute (on my slow machines).

You can't touch the executable sizes it produces with anything done in C++. Might I point out your code is coming in 468 K? The original John Gleason implementation of that algorithm in PowerBASIC was 13 K, as was the high performance one I posted previously which you just beat.

Couple years ago I spent about 6 months building an ActiveX Grid Control using PowerBASIC. Its a full featured grid control with self-registration as COM components are supposed to do. I built it all using low level code, built the virtual function tables by hand using low level C type structures, and so on. It compiled to 49 K. Using UPX executable packer it packs to 22K.

Last winter for a project I decided to convert it to C++ and make an x86 and x64 version. I used VC9 C++. I thought of doing it in C but what held me back is that there is a lot of string parsing code that would have been horrible in C. And of course I had my own C++ String Class so that's what decided me on C++ instead of C. It ended up around 100K in both x86 and x64, but the nice folks at UPX have an x64 exe packer and that brought it doen to 43K. That's really better than I expected. And to get it that small I had to use a special build of my String Class where I eliminated everything I didn't need. So I have my reasons for using PowerBASIC. Actually, they have a rather humorous (if you aren't on the receiving end of the barb) meta-statement named...

#BLOAT

It goes like this...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#BLOAT metastatement
Purpose
 Artificially inflate the disk image size of a compiled program.
 
Syntax
 #BLOAT size_expression
 
Remarks
 #BLOAT allows the creation of artificially bloated program files on disk, in order to match or exceed that generated by competing "BloatWare" compilers.  

#BLOAT does not affect the memory image size (running size) of a compiled program.
 
size_expression
 The size_expression parameter is a simple Long-integer expression that specifies the total desired size of the compiled programs disk image, but is ignored if it is smaller than the actual program size.  

#BLOAT uses sections of the actual compiled code to fill and obfuscate the portion added to the file.

While #BLOAT adds no true merit to the technical efficiency of the compiled code, there are a number of reasons for its use, including:

To allow "BloatWare" programmers to feel more comfortable when using PowerBASIC.

To impress project leaders/managers with the volume of executable code created.

To allay the fears of uninformed customers who may mistakenly infer that "such tiny programs couldn't possibly do everything that..."

To make certain versions of a program more readily identifiable simply by examining the size of the file on disk.

To improve convolution of the contents of the executable disk image, because the bloat region appears to contain executable code.
 
See also 
 #COMPILE EXE, #OPTIMIZE
 
Example
 #BLOAT 1024 * 1024 * 4  ' Create a 4 MB EXE file 

I thought this thread was about wide vs. narrow operations.

Might I point out your code is coming in 468 K?
Wrong. You're pointing out the size of the (static) executable your compiler is capable of producing. Link dynamically and then we'll talk.

The original John Gleason implementation of that algorithm in PowerBASIC was 13 K, as was the high performance one I posted previously which you just beat.
Does PowerBASIC install any runtime libraries in the system? If so, then the comparison isn't really fair.
Also, it doesn't look like PowerBASIC targets PICs or any sort of embedded systems. Who cares about executable size?

Couple years ago I spent about 6 months building an ActiveX Grid Control using PowerBASIC. Its a full featured grid control with self-registration as COM components are supposed to do. I built it all using low level code, built the virtual function tables by hand using low level C type structures, and so on. It compiled to 49 K. Using UPX executable packer it packs to 22K.
You should write a domain-specific language to automate generation of class definitions and object initialization code.
Oh, wait.

By the way, the .NET version of my code above compiles to 5 KiB and runs only 16.5x slower (although I should note that I don't know how to optimize .NET).
Topic archived. No new replies allowed.
Pages: 12