I was messing with this a few weeks ago as this question seems to come up a lot...
This was where i finally ended up. I was trying to see if I could beat a dumb loop for fun (nope, its about the same performance). I don't even know if it has an undefined activity in it, but it worked on g++. This can do up to 2 to the 64th which is the biggest you can fit in 64 bit int, so I stopped there. It may not give the correct answer for 0 to the 0 or other oddities.
//a cleanish version that wastes a ton of memory
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

long long ipow(long long b, unsigned long long e)
{
//13 multiplies and 3 jumps works for 64 bit ints.
//2^64 is the largest that will fit in a 64 bit int.
//1^any and 0^ any are handled.
long long pwrs[65]; //wastes a lot of space, improve?
if(e > 64) return 0; //won't fit!
if(e == 0) return 1; //shortcuts for trivial cases
if(e ==1  b == 1  b==0 ) return b;
long long result = 1;
pwrs[0] = 1;
pwrs[1] = b;
pwrs[2] = b*b;
pwrs[4] = pwrs[2]*pwrs[2];
pwrs[8] = pwrs[4]*pwrs[4];
pwrs[16] = pwrs[8]*pwrs[8];
pwrs[32] = pwrs[16]*pwrs[16];
pwrs[64] = pwrs[32]*pwrs[32];
result *= pwrs[1&e];
result *= pwrs[2&e];
result *= pwrs[4&e];
result *= pwrs[8&e];
result *= pwrs[16&e];
result *= pwrs[32&e];
result *= pwrs[64&e];
return result;
}
 
//an unclean version that does not.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

long long ipow4(long long b, unsigned long long e) //fast
{
long long result = 1;
long long p = b;
result *= (int)(( e&1)*p  (e&1)^1); p *= p; e >>= 1;
result *= (int)(( e&1)*p  (e&1)^1); p *= p; e >>= 1;
result *= (int)(( e&1)*p  (e&1)^1); p *= p; e >>= 1;
result *= (int)(( e&1)*p  (e&1)^1); p *= p; e >>= 1;
result *= (int)(( e&1)*p  (e&1)^1); p *= p; e >>= 1;
result *= (int)(( e&1)*p  (e&1)^1); p *= p; e >>= 1;
result *= (int)(( e&1)*p  (e&1)^1); p *= p; e >>= 1;
return result;
}
 
give that this mess is a wee bit cryptic, let me make a few notes for you..
observe that binary and powers play off each other. That is, observer that something cubed is raised to the 3 which is 11 in binary. Note that x * x squared is x cubed. Note that in binary, the 0/1 position can represent x and the 2's position represents x squared and the 3s position represents x to the 4th.. so the binary representation of the POWER can be used to construct the value. That is what I am doing to reduce the # of multiplications. So instead of 2 to the 32 (32 multiplications) you only need lg(32) which is 6 multiplications. However rather than try to figure out when we were done, I just did enough multiplies to cover up to 64 power (or 127, actually) and when there is no work to do it multiples by 1. The bitlogic in the 2nd version is almost entirely dedicated to this multiply by 1. In hindsight that was a lot of trouble.
both of these are faster than pow, but you won't notice this until you do many millions of calls.
you can put the code above in loops, of course. I tried that and it ran slower, and I could not figure out what I did wrong and since I was just playing, I gave up and unrolled by hand.
finally, note that you can change the return type and base to double (or any other type that supports a multiplication, like complex) and it will work just fine for the top version, the key is the integer POWER, that cannot be changed to another type for this logic. The bottom version not so good for that.