what is execution time / speed of cmath

Forum

Forum
General C++ Programming
what is execution time / speed of cmath

what is execution time / speed of cmath library functions?

Pages: 12

cmath has several functions such as
sqrt(), cbrt(), sin(), cos(), exp, log, log10 etc

I want to know the executio time of these functions?

Duthomhas (13143)

It depends, but they are almost always CPU instructions, so you are not likely to find a faster way to do them.

You would have to look at your PC’s specs.

jonnin (11340)

then time them.

most but not all modern CPU have basic math circuits in the FPU / CPU areas and they are as fast as the hardware runs, probably <10 cpu clock cycles for most of these but you can get that value. C++ compilers should use the direct CPU instructions for these commands with little to no overhead applied.

one notable exception is pow() which does too much work when trying to do small integer powers like squares and cubes. its usually better to just inline x*x instead of pow(x,2). There may be a couple more of these kinds of gotchas.

shubham1355 (8)

ok I understand that execution time will depend on specs of my cpu but the number of clock cycles taken by code will always be the same, so please suggest me some code or method to find the number of clock cycles taken by code ( for either c or c++) so that I can use it to determine the exact number of clock cycles taken by math library functions.

How can I find the number of clock cycles taken by the math library functions?

helios (17511)

the number of clock cycles taken by code will always be the same

Nope. Not even within the same architecture, necessarily. Intel doesn't even specify timings in the manuals for any instruction (that I can see).

Sorry, but you're not going to get what you're looking for. If you want to optimize a piece of code you'll need to time the whole thing (not instruction by instruction) and then refine it to shave time off. Nowadays there's no other way.

jonnin (11340)

this gives me 20 clocks for sin consistently on my machine. As noted above, you cant trust that to be an absolute exact value, but it gets you a data point. If you want it more exact, export the asm of your program and inject the timer code into it directly and avoid all the nonsense just dump to 64 bit registers and subtract. At the end of the day, though, its still not 100% assured and its going to vary across compilers and OS and hardware and so on.

uint64_t rdtsc()  //this is for x86 cpus.  for one call to math functions you can just use the low order bits and ignore the high. 
{
   static unsigned int lo;
   static unsigned int hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo -20; //this code itself takes 20 clocks on my system. 
}


int main()
{ 
	double r;
	uint64_t ts,tf;
	ts = rdtsc();
	r = sin(0.1223334444);
	tf = rdtsc();
	cout << tf-ts << endl;
  return 0;	
}

Honestly, I would just take it on faith that most of math is about as fast as it can be done on the machine. You can eek out some more speed in assembly, but most of that is culling overhead, not the actual computation part. The lower level you code at, the more you can cut overhead.

Last edited on

Cubbi (4774)

most of math is about as fast as it can be done on the machine.

...and then there is -ffast-math

jonnin (11340)

Which, its been a while? ... reorders the assembly to optimize register usage / overhead, sometimes better, sometimes not? Its always worth trying it both ways with that one.

helios (17511)

https://gcc.gnu.org/wiki/FloatingPointMath
Interesting. It seems it's not merely instruction reordering or register allocation. It performs algebraic transformations (that are mathematically valid but not IEEE-compliant), more lax rounding, and makes more assumptions about certain values.
I thought it was mostly about using 64-bit operations rather than 80-bit, or somesuch.

Last edited on

jonnin (11340)

OP, another thing you can look at is writing the assembly to do the math yourself (they will be short little things that just put the value in a register or fpu stack and invoke the circuit, get result back), and compare it to cmath. remember that you need to change base to do the logx functions as cpu probably only supports one log (probably lg?) and the others are change based. Or change your math to use the other base yourself, or find another way (a lot of log10 and lg calls are not necessary but rather 'shortcuts' to writing the code another way).

as said in the other thread, what do you really want here? You want terraflops, we got cpus that provide it. You want more, we got parallel computers that can do teraflops * N cpus. C's tools are pretty darn good at keeping the basic stuff quick. If you are trying to find the language overhead vs the cpu capability, its probably already been done, but that should be doable with the idea I just gave you. The tricky part is really the time measures, cpus get interrupted so wall clock is unreliable and clock ticks can be funky to work with and pipelines can make back to back instructions appear to go faster than they did (line an assembly line, the code in some cpus for some of these can do the first part of the third computation while its still doing the last part of the first one). Modern hardware is not as linear as it used to be and that is good in that its faster but its a real pain to get exact info. You have cpus that change their speeds to save power, context switches from bloated OS, caching, and so many other weird effects to sort through.

Last edited on

dhayden (5795)

How can I find the number of clock cycles taken by the math library functions?

The reason you can't know the number of clock cycles is because it depends on so much. Is the code (and data) in the cache? How is the instruction pipeline when you execute the code? How fast the RAM? So the answer is always "it depends."

Why do you need to know the number of clock cycles? If you're trying to optimize the code then this isn't the right way to do it.

Cubbi (4774)

it also massively depends on the values and types of the arguments. Check out this implementation of pow() (technically, crpow(), recently standardized for C2x) https://bitbucket.org/MDukhan/crlibm/src/30bc7f5a18b8acacea71407709e039e701561bfa/pow.c

Last edited on

shubham1355 (8)

@dhayden

Yes I'm trying to optimize the code, can you tell me more about it. As you said "this isn't the right way" then please tell me where I'm going wrong?

jonnin (11340)

step one is research. Is the thing you are doing being done via the best possible algorithm? An optimized bubble sort still stinks, in other words :)

step two is research into the use case. how fast does it need to be to solve your problem?

step three is to profile your code to see where it spends time.

step four is where you finally start thinking about code. Look at the slow stuff from the profiler: is it doing dumb memory things (this is the #1 slowness I see in modern c++ code, is memory abuse, ignorance of cache, disregard for unnecessary built in looping, … basically using the provided tools of the language to make slow code due to not paying attention.

step five is to fix known gotchas. Eg converting numbers to text is poorly implemented on most compilers. Pow is slow, as we said. There are other things like that to watch for.

step 6 is to re profile and then consider where you can thread out slow spots to make use of modern hardware better. If you have existing threading, re-evaluate whether it is helping or not.

and that is just kind of the getting started stuff. But most of the time you will stop at step 2, because most of the time, its actually fast enough already.

salem c (3686)

Just to add to jonnin's excellent list.

> step three is to profile your code to see where it spends time.
2.1 Your code under source control, such as git.
2.2 An automated test suite that you can run to verify any changes you make don't break stuff.
2.3 A real world data set to profile against. There's little point in profiling your test suite for example.

You profile the code with your real data.
Study the profile to identify where time is being spent.
Because your code is in git, you can branch and make changes without risk to your known good code.
You re-run your test suite to make sure you didn't break anything.
You re-run your profile, to see if you improved things.

Are you reading some large file, doing some calculation, then writing out another (perhaps similarly large) file? Unless your calculation is a monster, I/O will be your biggest time sink.

Will your users do anything different?
There's no point optimising a program which takes 60 minutes into one that takes 55 minutes. It's still "about an hour, I'm off to lunch". You need an order of magnitude improvement to turn "go to lunch" into "I'll get a coffee".

Or maybe you're at the other end doing graphical scene rendering for a game. You want your 120Hz refresh and you're a couple of mS off achieving that.

dhayden (5795)

Let's add some perspective. Jonnin says his machine calculates sin(x) in 20 clock cycles. How fast is that. Stand up right now and look at your feet. In the time it takes light to bounce off your feet and enter your eye, your computer can compute sin(x). At 3GHz it can compute sin(x) 150 million times per second.

Does your problem require that much computing power?

Most problems on the beginners forum can be solved lightning fast. Why don't you tell us more about the problem you're trying to solve. I think you'll find that a better approach will speed up the code far more than counting clock cycles.

shubham1355 (8)

Can anyone tell me in how many nanoseconds these functions will be executed?

MikeyBoy (5631)

Can anyone tell me in how many nanoseconds these functions will be executed?

You've already been given the answer to this: no, because it depends on the specs of the computer your working on.

salem c (3686)

I feel shubham1355 is leading us towards this.
http://xyproblem.info/

Too focussed on the blade of grass, to worry about the elephant standing on it.

helios (17511)

I feel like they're just deliberately wasting our time. That last question indicates they haven't even bothered reading the explanations.

Pages: 12