x86 Assembly - Indirect addressing

In x86/-64, is there any point in doing, for example,

mov eax, [ebx+ecx*4]

instead of

imul ecx, 4
add ecx, ebx
mov eax, [ecx]

other than code size and having to modify a register?
No expert on assembly but looks like one is more convenient than the other.
I'd also expect a difference in speed. I'd suggest benchmarking it.
If you disassemble the results of that first does it not end up exactly like your second? I would bet the end machine instructions would be almost identical

If your assembler lets you do the first, there's no reason to do the longhand besides having readability, but I'd guess readability isn't your first concern if you're writing any significant portions of your code in assembly
If you disassemble the results of that first does it not end up exactly like your second?
No, they're different opcodes.

If you're doing dynamic code generation, writing three simple instructions is easier than writing a single instruction with indirect addressing.
other than code size and having to modify a register?

Those are two very good reasons.
I'm not saying they're not, I'm asking if there are other reasons.
I'm asking if there are other reasons.

The only other reason I can think of is speed. Sometimes a sequence of smaller instructions is actually faster than a big complex instruction, although I'm not sure if that's true in this case.
closed account (48T7M4Gy)
count the number of clocks and check for any redundant transfers or storage operations between the two cases.

eg http://www.fermimn.gov.it/linux/quarta/x86/index.htm
Last edited on
That reference has uncertain applicability in a modern processor. I think the Intel manuals don't even include clock specs in the instruction reference.
Last edited on
closed account (48T7M4Gy)
And here's another one, keeping in mind X86 is a family of processors that is quite 'old' and how 64 bit fits into to the tables is not immediately clear. Intel and AMD don't appear to be all that forthcoming on the info.

How reliable these sites are is anybody's guess.

http://zsmith.co/intel_i.html#imul
other than code size and having to modify a register?

two registers: ecx and flags (imul will clear cf/of, add will clear sf)

as for relevant execution speeds, go to http://www.agner.org/optimize/instruction_tables.pdf -
looking at the Intel Skylake table there, a memory mov has latency 2 for all addressing modes, while imul alone is latency 3 (and is fixed to just one execution channel). Of course it only matters if your data are ready in L1 cache (such as because you're using [ebx+ecx*4] in a loop!)
Very cool reference. Thanks for that!

So if SHL was used instead of IMUL, does that mean that, since each instruction needs the result of the previous one, that doing SHL-ADD-MOV has a total latency of 3.5, while doing just MOV has a total latency of 2?
helios wrote:
doing SHL-ADD-MOV has a total latency of 3.5, while doing just MOV has a total latency of 2?

possibly, depending on how the CPU will optimize that code.

In fact, I'm going to give it a spin, because I like pointless benchmarks.

Executing each piece of code 10'000'000'000 times, on Xeon L5520, compiled with clang++. Switching the registers to rdi/rsi to match the C callling convetion

This probably could be better, but I have to get back to real work. full program: http://coliru.stacked-crooked.com/a/9dc5bf5abcc79780

mov version:
1
2
3
4
5
  400520:       bf 6c 09 60 00          mov    $0x60096c,%edi
  400525:       be 02 00 00 00          mov    $0x2,%esi
  40052a:       8b 04 b7                mov    (%rdi,%rsi,4),%eax
  40052d:       48 ff c9                dec    %rcx
  400530:       75 ee                   jne    400520 <main+0x10>


CPU Time: 8.094s
Instructions Retired: 50,022,174,000
CPI Rate: 0.408


shl version
1
2
3
4
5
6
7
  400520:       bf 6c 09 60 00          mov    $0x60096c,%edi
  400525:       be 02 00 00 00          mov    $0x2,%esi
  40052a:       48 c1 e6 02             shl    $0x2,%rsi
  40052e:       48 01 fe                add    %rdi,%rsi
  400531:       48 8b 06                mov    (%rsi),%rax
  400534:       48 ff c9                dec    %rcx
  400537:       75 e7                   jne    400520 <main+0x10>

CPU Time: 12.192s
Instructions Retired: 70,037,218,000
CPI Rate: 0.439


imul version
1
2
3
4
5
6
7
  400520:       bf 6c 09 60 00          mov    $0x60096c,%edi
  400525:       be 02 00 00 00          mov    $0x2,%esi
  40052a:       48 6b f6 04             imul   $0x4,%rsi,%rsi
  40052e:       48 01 fe                add    %rdi,%rsi
  400531:       48 8b 06                mov    (%rsi),%rax
  400534:       48 ff c9                dec    %rcx
  400537:       75 e7                   jne    400520 <main+0x10>

CPU Time: 12.124s
Instructions Retired: 70,038,126,000
CPI Rate: 0.437


looks like shl and imul took the same time (the diff is noise, it skewed the other way on another run)
Curious how for SHL the time increased by 50%, as predicted by the latency numbers, but not for IMUL.
I guess that settles that.
I am guessing the cpu saw imul with a power of 2 and executed an shl instead
I guess, but it's interesting that the CPU has time to performs those kinds of checks.
Topic archived. No new replies allowed.