Why is this so slow? Is it because of virtual functions?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void Render(const Renderable& obj)
// An instance of Sprite gets passed in, which is derived from Renderable
{
  sprite->Draw(tcr.GetDX9Texture(2));
  sprite->Draw(tcr.GetDX9Texture(obj.GetTextureID())); // Three times slower
  sprite->Draw(tcr.GetDX9Texture(obj.GetTexture().GetID())); // Also three times slower
}

// virtual
int Sprite::GetTextureID() const
{
  return texture.GetID(); // const Texture& texture;
}

int Texture::GetID() const
{
  return id; // int id
}

IDirect3DTexture9* TextureCache::GetDX9Texture(int id)
{
  return textures[id]; // Array of IDirect3DTexture9*
}


Are virtual function calls really THAT expensive, or is it something else? I have no idea...

Oh btw, I am absolutely horrible when it comes to architecture/design, that's what I'm currently experimenting with, so yes, the design could probably be done better. I could save the pointer directly in obj, that has almost the same speed as putting in the value directly, but then it needs to know about DirectX. Is it not possible to do it efficiently in an abstract way like this, somehow?

I would post some more code, but somehow this forum is broken for me, can't preview my post and can't use any of the buttons for code tags etc...
This part of the code does not seem to be an issue.
Virtual function calls aren't that expensive (draw calls are certainly A LOT more expensive).
Try profiling or posting more code.
What's in your game loop?

EDIT: just noticed comments.
How are you calculating this "Three times slower"?
Last edited on
virtual functions adds a level of indirection. I.e. a pointer to a pointer.

You have to call a virtual function a lot in order to get a significant difference. And then the number of calls are the real problem.

Especially when it comes to drawing there is a high risk to make unnecessary calls.
The DirectX device interface uses virtual methods.
I just suspect he's incorrectly using the FPS as a benchmark (in case you are, consider using per-frame draw time instead, in milliseconds).
Quick googling: http://renderingpipeline.com/2013/02/fps-vs-msecframe/
I tried uploading my code to GitHub, but it's way too complicated, I will have to spend a few weeks trying to figure their system out...
So instead here's a zip file:
http://s000.tinyupload.com/?file_id=00806869705664998538
The code from above is in Renderer.h, Render().

By three times slower I mean that FPS drops from 1400 to about 500 when using indirection.
I measure it by literally just counting fps by increasing a variable after every render, and after a second display it and reset it to zero.

Maybe it's because of inlining, didn't bother splitting it up into cpp files yet and IIRC everything in header files gets inlined automatically, right?

I also tried profiling with the VerySleepy profiler, but I cannot interpret it correctly. There are a lot of function calls that I have no idea what they're doing. Like WaitForSingleObject seems to take up 40% of the time, which I don't use in my code, but seems to get called by Windows or DirectX, who knows...
I wish I could use the easier VisualStudio Profiler, but I use the free express version, which sadly does not have it.
WaitForSingleObject is used for mutexes and semaphores. If you are not using them, it's your drivers that do that, and you should not worry.
I repeat myself: measuring performance using FPS is wrong.
What's the average time a frame draw takes?

1400 fps: 0.71ms
500 fps: 2ms
60 fps: 16ms

As you can see it only took a millisecond more (debug informations are probably one of the causes: try compiling in release mode).
There's still a lot of space left til the 16ms/s required for 60fps.
But isn't ms just the inverse of FPS?
FPS = 1000 / msPerFrame
ms = 1000 / FPS
If a frame takes 20ms to render, it means in one second I can render 50 frames, or 50 FPS.
If my program can output 1500 FPS, it takes 0.66ms to render one frame.
If with one method I have 500 FPS, and the other 1500 FPS, my program runs 3 times slower. How I measure it doesn't matter. If it takes 2ms with one method and then 0.66ms with the other, it is still 3 times slower.
But of course generally you are right, I should switch to ms. With FPS counting I can only see how fast the whole process of rendering one frame is, not individual parts inside.

Anyways, I now put measuring code (nanoseconds, not FPS) around the call to sprite->Draw(), and it seems to "only" run 40% slower, instead of 3 times slower. My guess is that it may be only because of the additional measuring...
But if that is not the problem, why does my program suddenly run so much slower, when I change this ONE line?

I tried it both in release and debug mode, by the way. Same slowdown.
I managed to upload the code to GitHub now, in case someone wants to take a quick look without downloading something:
https://github.com/TheHorscht/test/blob/master/Renderer.h#L113
One line below is the other method thats faster.
It's not having to do with the extra calculations. You shouldn't develop a game with FPS in mind,but with msec/frame in mind, since it is more accurate. Its inaccuracy comes exactly because it's an INVERSE. Read the article I linked above.

What is in your main loop, how many objects are you drawing?
I did read the article, but I didn't really "get it" the first time, I'll read it again later.

The reason why I'm looking at the FPS is because of someone on YouTube, who is doing a tutorial on how to make a game engine and he is displaying the FPS and I want to compare my program, at least a little bit, just to know if I'm on the right track, which I seem to be not, since I get much much lower FPS. (Or inversely, each frame takes much longer to render).
I'm trying to render 1000 sprites at the moment, through the D3DXSprite interface, which if I understand correctly is a utility class for batch renderering. So the call to sprite->Draw shouldn't actually draw it, just queue it or something.

This is my main loop ( https://github.com/TheHorscht/test/blob/master/main.cpp#L68 ):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
renderer.Begin();
for (unsigned int i = 0; i < vec.size(); i++)
{
	vec[i]->SetRotation(vec[i]->GetRotation() + 0.001f);
	renderer.Render(*vec[i]);
}
renderer.End();
fps++;
std::wstringstream ss;
ss << L"FPS: " << fps;
if (timerFPS.GetElapsedSeconds() >= 1)
{
	window.SetTitle(ss.str());
	fps = 0;
	timerFPS.Reset();
}



EDIT:
I've read the article again, I think I kind of understand it now, but in the article the post processing effect adds a constant time to the total time it takes to render one frame, while in my program the time is proportional to the number of sprites.
So in this simple case it should not matter, but I'm going to use ms(or nanoseconds) instead of FPS in future measurements.

I just tested some more and I think I found what caused the huge slowdown:
std::vectors accessors...
I changed it to a c-style array and now both methods are running at the same speed.
I knew vector has some overhead because of boundary checking, but THIS MUCH? Wow...
Last edited on
There is no bound checking in operator[] (there is in member function at() tho).
What if you used iterators?

1
2
3
4
5
for(auto it = vec.begin(); it != vec.end();+it)
{
    *it->SetRotation(*it->GetRotation() + 0.001f);
    renderer.Render(**it);
}

Please also note that using a vector of pointers for such operations is slower than a vector of objects due to CPU/RAM caching (this might even BE your issue. noncached memory access is incredibly slow)
When I "fixed" the problem I apparently did something wrong again, because now even with a c-style array it runs just as slow as with a vector.

Btw I just noticed that I lied in my original post... oops, sorry :(
I was being sloppy and thought "Ah whatever it's basically an array".
And to keep it simple I didn't mention I use std::pair too.
The container I'm getting my values from is actually a vector of pairs.
1
2
3
4
IDirect3DTexture9* TextureCache::GetDX9Texture(int id)
{
  return textures[id].second; // std::vector<std::pair<std::unique_ptr<Texture>, ComPtr<IDirect3DTexture9>> textures
}


Oh man this is way too complicated hahaha...
I'm just trying to map an "id" or index to a texture pointer, so that the user of "Texture" does not have to know anything about DirectX or OpenGL or whatever. Instead of holding the actual IDirect3DTexture9* which could maybe change to something else, it instead should just say "Im using Texture #1" and let the TextureCache class figure out which one is #1 (basically just textures[1]). Kind of like how indexed bitmaps work.

For iterating over the renderables in my main loop, I first used iterators, but then tried if using direct access would be faster and for some reason I think it was, otherwise I would have gone back to iterators. I don't even know anymore... (Edit: I just tried it and it seems in debug mode, using iterators is much slower, but in release both are identical.)

Thanks for the tip with CPU/RAM caching, didn't know that before. I will have to experiment with that, even though it's super confusing because of all the copying and moving when inserting actual objects into STL containers. emplace_back doesn't seem so work like it should in Visual Studio. Instead of constructing the objects with a passed in initializer list itself it just uses the move constructor. At least that's how I understood it. (Edit: And I was wrong again, sorry! Seems to be from an older version of VS).
Last edited on
SGH wrote:
There's still a lot of space left til the 16ms/s required for 60fps.


This.

Horscht: You seem to be getting your panties in a bunch over nothing. 2 ms per frame is incredibly fast for all practical purposes. Your code can take 7x as long as it does now and still run at full speed. I'm not sure why you feel the need to optimize it.

On top of that, you are trying to micro-optimize, which is even worse. Things like how you access elements in a container ([] operator or iterators?), whether or not you use virtual functions -- these things never matter. This is not where your performance problems are coming from. You are completely wasting your time and energy focusing on the most efficient way to access this information. At best you'll shave a microsecond off your frame time so instead of 2 ms per frame, you'll run at 1.999 ms. Seriously, this isn't worth it.


Are you even running this with optimizations turned on? A lot of this might be debugger overhead.

Okay I read another post here and saw that you did try it with optimizations.
Last edited on
Of course it would be enough if I could run at a constant 200fps, but that's not the point of this thread. I'm trying to figure out why simply getting some value from somewhere else, instead of passing it in right on the spot is about 2-3 times slower, that is hardly micro-optimizing. If my code would just run 1% slower I wouldn't care, but the reason why I started this thread is because I absolutely do not understand why changing one line slows down my program by a factor of THREE. I'm just trying to learn, not create the world fastest render engine :) So finding out WHY is more important to me than actually getting faster code.

My main reason for doing this "Render Engine Thing" is because I am absolutely horrible at software design/architecture, seperating everything into classes, keep coupling low, high cohesion and all that fancy stuff :)
How to pass stuff around, which objects should belong where etc...
And that's why I need to know why this:
 
  sprite->Draw(tcr.GetDX9Texture(obj.GetTextureID()));

Is SOOOOO much slower than this:
 
  sprite->Draw(tcr.GetDX9Texture(2));


Oh and btw, I just tried messing around with actual objects instead of smart pointers, and well, it's going just as expected, it's an absolute nightmare thanks to error messages from STL being absolutely cryptic and my objects get moved around and copied etc without telling me why or where. It's just crashing in Render(const Renderable& obj) because obj is apparently not valid or something, I have no idea anymore hahaha, this is way over my head :D
Well also consider you're calling that function 1k times. This means it's taking one microsecond each call, on average.

Side note, what happens if it's not a virtual function but a regular one?
Last edited on
Doesn't seem to make much of a difference if it's virtual or not.
I'm not even sure if I'm measuring it right, it would probably be much easier with a good profiler, but I'm stuck with using free software.
I tried VerySleepy and AMDs CodeXL but don't know how to interpret the data correctly yet.

This is how I measure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
static Timer secondTimer;
static long long avg1 = 0;
static long long avg2 = 0;
static int count = 0;
int randnr = rand() % 2 == 0 ? 1 : 0;
Timer tmr1;
auto d = textureCache->GetDX9Texture(randnr);
avg1 += tmr1.GetElapsedNanoseconds();
tmr1.Reset();
auto c = textureCache->GetDX9Texture(obj.GetTextureID());
avg2 += tmr1.GetElapsedNanoseconds();
count++;

if (secondTimer.GetElapsedSeconds() >= 1)
{
	std::stringstream ss;
	ss << "GetDX9Texture(randnr): " << avg1 / count << '\n';
	ss << "GetDX9Texture(obj.GetTextureID()): " << avg2 / count << '\n';
	OutputDebugStringA(ss.str().c_str());
	secondTimer.Reset();
	avg1 = 0;
	avg2 = 0;
	count = 0;
}


When I change the code from:
1
2
3
4
int GetTextureID() const
{
	return 0;
}

to
1
2
3
4
int GetTextureID() const
{
	return texture.GetIndex();
}

It's definitely slower. Even though GetIndex() just returns a member.
The first takes about 380 nanoseconds, the second 440. About 15% slower.
Well, I have no Idea anymore.
Maybe it has to do with CPU caching or something, like you said. Because it has to get 1000 values from maybe totally different areas in memory. Who knows, I dont.
I'll have to read some general guide on performance, what pitfalls there are etc.
I never paid attention to anything like that, until I tried to write this Render Class, and after just a small change suddenly it runs much slower, I was like WHAAAT? Whyyyy? i don't get it...
Anyways, thanks for your input, I'm gonna take a break from this problem for a while :)
Last edited on
Topic archived. No new replies allowed.