Odd behaviour with own sqrt() using gcc __asm__

Hi, I wrote a quick sqrt()-function:

1
2
3
4
5
6
7
double sqrt(double n){
    __asm__(
            "fsqrt\n"
            :"+t" (n)
            );
    return n + 1; //I'll explain this shortly
}


I also included <cmath>, and called std::sqrt() in main():

1
2
3
4
5
6
7
int main(){
    double d = 2;
    double n = sqrt(d);
    double m = std::sqrt(d);
    printf("n: %f, m: %f\n", n, m);
    return 0;
}


The output is interestingly enough:

n: 2.414214, m: 2.414214


My own function's output is used in both cases, as you can see. sqrt(2) should be 1.4142..., but my function adds 1 to it. When I ditch the variable d, and just use a constant 2, like this:

1
2
    double n = sqrt(2);
    double m = std::sqrt(2);


Now, my output is:

n: 2.414214, m: 1.414214


As I'd expect it to be.
Is this a weird optimization g++ tries to do or what's going on?

Edit: formatting
Last edited on
Is there a sqrt(int) overload? (EDIT: Apparently in C++11: https://en.cppreference.com/w/cpp/numeric/math/sqrt)

Try sqrt(2.0)
Last edited on
Even putting a static 2 in the function calls wouldn't make it call the correct functions for me. In release mode, it wont even compile without changing the argument double n to int n.

This works:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
double sqrt(int n)
{
	return n + 1;
}


int main() 
{
	int d = 2;
	double n = sqrt(d);
	double m = std::sqrt(d);
	printf("n: %f, m: %f\n", n, m);
	return 0;
}


However, I don't know why it feels the need to call the defined function when using a double.
@dutch

Huh, now the output is
n: 2.414214, m: 2.414214
.

Interesting. I just don't get why it compiles to using my function when calling std::sqrt(). All I can think of is that gcc/g++ somehow recognizes that they're both square root functions and for some reason decides to use my function?
I can't recognize that your function is a "square root" function. You could change the asm to just add 1 and it would still be called.

It's common for <cmath> to dump its functions into the global namespace (sadly it's allowed to do this and seems common).

I guess doing that causes there to basically be no difference between sqrt and std::sqrt and your function overrides that symbol.

As an experiment you could find your cmath file and temporarily edit it, removing the symbol dump. (I might try that.)
Last edited on
Objdump to the rescue, sort of.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Where the output is the same:
0000000000001183 <main>:
    1183:	55                   	push   %rbp
    1184:	48 89 e5             	mov    %rsp,%rbp
    1187:	48 83 ec 20          	sub    $0x20,%rsp
    118b:	f2 0f 10 05 8d 0e 00 	movsd  0xe8d(%rip),%xmm0        # 2020 <_ZStL19piecewise_construct+0x18>
    1192:	00 
    1193:	f2 0f 11 45 f8       	movsd  %xmm0,-0x8(%rbp)
    1198:	f2 0f 10 05 80 0e 00 	movsd  0xe80(%rip),%xmm0        # 2020 <_ZStL19piecewise_construct+0x18>
    119f:	00 
    11a0:	e8 b0 ff ff ff       	callq  1155 <sqrt>
    11a5:	66 48 0f 7e c0       	movq   %xmm0,%rax
    11aa:	48 89 45 f0          	mov    %rax,-0x10(%rbp)
    11ae:	f2 0f 10 05 6a 0e 00 	movsd  0xe6a(%rip),%xmm0        # 2020 <_ZStL19piecewise_construct+0x18>
    11b5:	00 
    11b6:	e8 9a ff ff ff       	callq  1155 <sqrt>
    11bb:	66 48 0f 7e c0       	movq   %xmm0,%rax
    11c0:	48 89 45 e8          	mov    %rax,-0x18(%rbp)
    11c4:	f2 0f 10 4d e8       	movsd  -0x18(%rbp),%xmm1
    11c9:	f2 0f 10 45 f0       	movsd  -0x10(%rbp),%xmm0
    11ce:	48 8d 3d 34 0e 00 00 	lea    0xe34(%rip),%rdi        # 2009 <_ZStL19piecewise_construct+0x1>
    11d5:	b8 02 00 00 00       	mov    $0x2,%eax
    11da:	e8 51 fe ff ff       	callq  1030 <printf@plt>
    11df:	b8 00 00 00 00       	mov    $0x0,%eax
    11e4:	c9                   	leaveq 
    11e5:	c3                   	retq   


And where it is different:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
0000000000001183 <main>:
    1183:	55                   	push   %rbp
    1184:	48 89 e5             	mov    %rsp,%rbp
    1187:	48 83 ec 20          	sub    $0x20,%rsp
    118b:	f2 0f 10 05 8d 0e 00 	movsd  0xe8d(%rip),%xmm0        # 2020 <_ZStL19piecewise_construct+0x18>
    1192:	00 
    1193:	f2 0f 11 45 f8       	movsd  %xmm0,-0x8(%rbp)
    1198:	f2 0f 10 05 80 0e 00 	movsd  0xe80(%rip),%xmm0        # 2020 <_ZStL19piecewise_construct+0x18>
    119f:	00 
    11a0:	e8 b0 ff ff ff       	callq  1155 <sqrt>
    11a5:	66 48 0f 7e c0       	movq   %xmm0,%rax
    11aa:	48 89 45 f0          	mov    %rax,-0x10(%rbp)
    11ae:	f2 0f 10 05 72 0e 00 	movsd  0xe72(%rip),%xmm0        # 2028 <_ZStL19piecewise_construct+0x20>
    11b5:	00 
    11b6:	f2 0f 11 45 e8       	movsd  %xmm0,-0x18(%rbp)
    11bb:	f2 0f 10 4d e8       	movsd  -0x18(%rbp),%xmm1
    11c0:	f2 0f 10 45 f0       	movsd  -0x10(%rbp),%xmm0
    11c5:	48 8d 3d 3d 0e 00 00 	lea    0xe3d(%rip),%rdi        # 2009 <_ZStL19piecewise_construct+0x1>
    11cc:	b8 02 00 00 00       	mov    $0x2,%eax
    11d1:	e8 5a fe ff ff       	callq  1030 <printf@plt>
    11d6:	b8 00 00 00 00       	mov    $0x0,%eax
    11db:	c9                   	leaveq 
    11dc:	c3                   	retq   


Seems like when the output is the same, we can see two calls to <sqrt>, meaning that std::sqrt() has been optimized to my sqrt(). When it is different, on the other hand, there is but one call. When given a constant, it seems like std::sqrt() is calculated pre-hand, and there is no need to call it separately, and because there is no need to call it separately, it doesn't get optimized to my function. That's probably the reason that std::sqrt(2) worked for me.

I found this, and I'm assuming this has something to do with the pre-calculated square root:
1
2
3
4
5
6
7
8
9
0000000000001226 <_GLOBAL__sub_I_sqrt>:
    1226:	55                   	push   %rbp
    1227:	48 89 e5             	mov    %rsp,%rbp
    122a:	be ff ff 00 00       	mov    $0xffff,%esi
    122f:	bf 01 00 00 00       	mov    $0x1,%edi
    1234:	e8 a4 ff ff ff       	callq  11dd <_Z41__static_initialization_and_destruction_0ii>
    1239:	5d                   	pop    %rbp
    123a:	c3                   	retq   
    123b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
@dutch
That makes sense.
The declaration/definition of sqrt in cmath looks like this. If I remove the dumping of sqrt into the global namespace I get ambiguous overload errors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
...
#include <math.h>   //ME: this declares sqrt,sqrtf,..., and a quick peek
                    //    looks like its declaring them globally
...

#undef sqrt      //ME: get rid of sqrt macro defined in math.h

...

//ME: in the std namespace
namespace std _GLIBCXX_VISIBILITY(default)
{
_GLIBCXX_BEGIN_NAMESPACE_VERSION

...

  using ::sqrt;  //ME: dump sqrt into the global namespace

//ME: apparently these aren't always defined, but note they are
//    inline and use a gcc builtin (presumably calling the assembly
//    instruction like your code).

#ifndef __CORRECT_ISO_CPP_MATH_H_PROTO
  inline _GLIBCXX_CONSTEXPR float
  sqrt(float __x)
  { return __builtin_sqrtf(__x); }

  inline _GLIBCXX_CONSTEXPR long double
  sqrt(long double __x)
  { return __builtin_sqrtl(__x); }
#endif

//ME: an overload for when the input is an integer
  template<typename _Tp>
    inline _GLIBCXX_CONSTEXPR
    typename __gnu_cxx::__enable_if<__is_integer<_Tp>::__value, 
                                    double>::__type
    sqrt(_Tp __x)
    { return __builtin_sqrt(__x); }

...

Last edited on
Dang, why is it allowed to do that? I mean this is a fairly nische problem and easily avoidable with a class or namespace by the programmer, but wouldn't it make more sense to contain all default libraries under std::? Is it just for C-compatibility reasons?

Anyway, thanks for helping me out here.
Just create your own namespace, job done.
Then you don't have to worry about std or global namespace pollution.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ cat baz.cpp
#include <iostream>
#include <cmath>
#include <cstdio>
using namespace std;

namespace bembel {
double sqrt(double n){
    __asm__(
            "fsqrt\n"
            :"+t" (n)
            );
    return n + 1; //I'll explain this shortly
}

} // namespace

int main(){
    double d = 2;
    double n = bembel::sqrt(d);
    double m = std::sqrt(d);
    printf("n: %f, m: %f\n", n, m);
    return 0;
}
$ ./a.out 
n: 2.414214, m: 1.414214
Another option would be to put your function into a namespace that you define then specifically scope that function when you need it.



I would expect math's sqrt to become assembly sqrt anyway. Once this is all working, let us know if you actually gained anything?
Varies a bit, but here goes. One billion calls to sqrt, with cmath, my fsqrt and as a third, sqrtsd:
Without optimization:

cmath:
4.658 s
fsqrt:
9.838 s
sqrtsr:
8,314 s


-O3:

cmath:
3.037 s
fsqrt:
6.520 s
sqrtsr:
3.402 s


So all in all, completely useless :D

I read https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi
and thought I'd check it out myself. The author claimed that math.h's square root was too slow for him, and tried out 14 different sqrt implementations. He found that some implementations were up to five times as fast as math.h's. Someone also recommend using sqrtsd instead of fsqrt, which is why I added it to my tests. Anycase, article doesn't seem to hold up anymore.

The whole test code:
1
2
3
4
5
6
7
8
int main(){
double n = 2.0;
for(int i = 1000000000; i >= 0; i--){
   n = sqrt(n);
}
   printf("n: %f\n", n);
   return 0;
}


Replace sqrt with the different implementations.
I can't see that being useful since a long time ago (about 1994 when all major PC-CPU started having a FPU built in that has a sqrt on chip). Before that, with the soft floating point (eg 286 era) it may have been possible to beat it by rolling your own? Or on some integer based embedded system, it could be true.
Currently, all you are doing is adding overhead to the FPU's function.
if you have ENOUGH of them to do back to back, you could do them in multiple threads and get a flat multiplier speedup, say a billion on each of 4 cpus is roughly 4x faster than 4 billion on one cpu.

**something interesting... you may be able to approximate the root to low precision faster, or even integer precision. If that is good enough. Not every program NEEDS to be terribly accurate, just as sometimes you can say 'its 3 miles thataway' and sometimes you need the info in nanometers.

But, there are things in the standard tools that ARE very slow. You can write an integer power function that is 2-3 times faster than the pow() method because pow does a lot of extra junk to handle floating point powers which slow it down if you are not using that feature. Or number to text is not well implemented on all systems -- I just had a DIY on that and it more than doubled the speed of my code, and repaired a MD5 implementation I got off the web that used sprintf to make hex strings and taking that out in favor of my own was also a multiplier faster...

Its a case by case thing, or a tribal knowledge thing... most of the tools are pretty efficient, and when they are not, its usually due to special case vs general purpose problems (your special case is faster than their general purpose solution because you discard some stuff you don't need to do). You will NOT beat anything that is on chip or in FPU directly (trig, sqrt, log, etc hand rolled is going to lose).
Last edited on
Topic archived. No new replies allowed.