Memory Accesses vs. FP Operations

Hello everyone!

It's been a shocker to find out the hard way that complicated expressions that handle floating point operands take as much time to evaluate as those handling integer operands. Now, is memory access and retrieval faster than either two? I'm thinking some results are redundant and might as well cache them aside and reuse them.

What do you think?
Sounds like you're on the verge of inventing lookup tables.

http://en.wikipedia.org/wiki/Lookup_table
It's been a shocker to find out the hard way that complicated expressions that handle floating point operands take as much time to evaluate as those handling integer operands.
You should expect integer arithmetic to be faster. Your observation that the two run at the same speed on your kit just goes to show how much developement has gone into FPUs and code generators.

Now, is memory access and retrieval faster than either two?
It depends on the calculation you're comparing it to.

I'm thinking some results are redundant and might as well cache them aside and reuse them.
Again it depends on what you're doing. Performance tuning always depends on what you're doing.
Generally speaking, it's normal to store the results of expensive computations. But if your kit can do what you need done on the fly, then nuf respect. The rest of us tend to store these things.
Now, is memory access and retrieval faster than either two?


I'd think they are the same. Each are just stored as a series of bits.
Sounds like you're on the verge of inventing lookup tables.


@Moschops,

Hahah, no no, I perfectly know what LUTs are, but I was avoiding using them because I'm too lazy, or didn't think the application would run slow.


@kbw,

It depends on the calculation you're comparing it to.


Checks for an integer equality, it's either that this is your answer, or that I didn't get what you're saying.



> Now, is memory access and retrieval faster than either two?

Have you analysed the spatial/temporal/sequential locality of reference?
http://en.wikipedia.org/wiki/Locality_of_reference#Spatial_and_temporal_locality_example:_matrix_multiplication
It's been a shocker to find out the hard way that complicated expressions that handle floating point operands take as much time to evaluate as those handling integer operands.
I find this hard to believe.
Could I see the code you used to make the measurement?

On the subject of lookup table efficiency, one time I was optimizing this function. I had basically reduced the thing to its near minimal expression, save for one thing: there was an integer division that I had no way of eliminating. Since its domain was small (both of its operands where in the [0;255] range), I tried building a lookup table such that
a/b == table[(a<<8)+b] (or something to that effect)
The speedup was rather small, but it was still large enough to be measurable.
Memory access is fast enough and some operations are slow enough to make lookup tables worth considering in many situations.

However, an equality check takes only a subtraction and a comparison to zero. Both of these operations are what modern processors are best at. There's no way a lookup table would make them faster.
You better read this: http://www.azulsystems.com/events/javaone_2009/session/2009_J1_HardwareCrashCourse.pdf

After that it should be clear that your question doesn't make sense and has no simple answer.

Typical memory latency in clock cycles:
HDD: ~10,000,000
SSD: ~200,000
RAM: ~200
L2: ~15
L1: ~3
register: < 1

So, basically, if your hot data set is bigger than L2 cache, it is only worth caching results of operations taking longer than several hundreds of CPU cycles each. Caching results of simple operations (comparisons, additions), either integer, or FP is pointless, because simple ones take less than 1 CPU cycle on most modern CPUs.
Last edited on
@helios,

I was working on a function that does convolution and some local image enhancements in the range [0,1]. But it is called very often that I thought if I worked in the range [0,255] i.e. data type Byte, then rescaled in the end, it would be definitely better. The run time was almost the same.

Maybe I didn't make myself clear regarding how I will be using LUTs, the function I'm using will be called twice almost 80% of the time with the same parameters, therefore, I was thinking I'd cache its return value and find an appropriate way to retrieve the value (not so smart way though). And the index of the LUT would be the two parameters of the function, which are integers.

Thanks all for the help, I'll definitely take a look at the links provided.
I was working on a function that does convolution and some local image enhancements in the range [0,1]. But it is called very often that I thought if I worked in the range [0,255] i.e. data type Byte, then rescaled in the end, it would be definitely better. The run time was almost the same.
In my scenario, the operands were already integers, so the lookup table made sense in that case.
You probably wasted too much time in the float to integer conversion. That operation is something you should avoid doing at all costs in high performance code.

Maybe I didn't make myself clear regarding how I will be using LUTs, the function I'm using will be called twice almost 80% of the time with the same parameters, therefore, I was thinking I'd cache its return value and find an appropriate way to retrieve the value (not so smart way though). And the index of the LUT would be the two parameters of the function, which are integers.
That's memoization, not lookup. Lookup involves storing in advance a large number of common results, maybe all possible results. Memoization only remembers results that have already been calculated.

There are many possible solutions for your problem. Which one is the best one depends on how exactly your code is structured. Could you post a simplified version?
Sure, this is an overly simplified version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

int main()
{
    char * in, out;
    // Load image
    
    int i, j;
    for (i = 0; i < width*height; i++)
    {
        for (j = 0; j < width*height; j++)
        {
            // ... some code here
            some_function(out, in, i, j);
        }
    }
    
    // Store out
    return 0;
}

Now I can't just return an integer variant in some_function because it involves normalization and what not, but I am trying to restrict myself to calling some_function with integer-variant operands. In other words, I am only converting to double in one line of the code, not the line in main where some_function, which is btw used inline, is called, but rather one line within some_function.

Last edited on
Is some_function() the function whose results you want to memoize? It doesn't fit in with your last statement.
the function I'm using will be called twice almost 80% of the time with the same parameters
Yes, indeed, I could split it into two, and let the first of the two return a value, do something with the value, then pass the value (or something else) to the second function.
Last edited on
Topic archived. No new replies allowed.