Please explain the code snippet

The following code snippet is from a C++ text book.
The goal is to understand what is going on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  void send(int ∗to, int ∗from, int count)
  //Duff ’s device. Helpful comment deliberately deleted.
{
int n = (count+7)/8;
switch (count%8) {
case 0:do {∗to++ =∗from++;
case 7:∗to++ =∗from++;
case 6:∗to++ =∗from++;
case 5:∗to++ =∗from++;
case 4:∗to++ =∗from++;
case 3:∗to++ =∗from++;
case 2:∗to++ =∗from++;
case 1:∗to++ =∗from++;
} while (−−n>0);
}
}
Hint: if a case section doesn't have a break statement, then execution continues to the next line of code, i.e. the next case section.
is the code correct? it looks messy to me.
I know that. I can't understand how do while and switch are combined.
In addition, I'm not quite sure about why the arithmetic is done at the beginning.
Another hint: case whatever: is just a label. It's not a statement, and it doesn't start a new block, or anything like that.

Here's the same code in what may be a more helpful format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void send(int ∗to, int ∗from, int count)
//Duff ’s device. Helpful comment deliberately deleted.
{
   int n = (count+7)/8;
   switch (count%8) 
   {
case 0:
      do 
      {
         ∗to++ =∗from++;
case 7:
         ∗to++ =∗from++;
case 6:
         ∗to++ =∗from++;
case 5:
         ∗to++ =∗from++;
case 4:
         ∗to++ =∗from++;
case 3:
         ∗to++ =∗from++;
case 2:
         ∗to++ =∗from++;
case 1:
         ∗to++ =∗from++;
      } while (−−n>0);
   }
}

@MikeyBoy's point is key to understanding how this code even compiles.

IMNSHO, this is what Stroustrup would call a "cute" solution that should not be used.

It combines two lexical ideas in a way that does not seem clear at first, and even old hands like me look and ask just as you have, "does this even compile?"

While @MikeyBoy's point is key, it does bring to mind an obvious exception: these aren't STANDARD labels. You can't "goto" any of them, for example.

@aligh, this seems to me a contrived example for illustration (in class perhaps) that copies integer arrays. It is clearly in C, at least in style. I have seen things LIKE this (not this exact version) which violate most modern coding standards relative to clarity.

The idea is to offer an "unrolled loop" that copies 8 integers per loop, which naturally may not be exactly the size of the integer arrays involved. The math at the start deals with this "leftover" so as to complete the loop.

It appears to be an effort to avoid the more common cleanup code, where the "n" sized chunks are processed first, then code checks for leftovers and processes them.

It should work, but it is an example of a lack of clarity for which better, cleaner and possibly even faster solutions already exist. Personally, I don't like to feed this kind of nonsense to students unless they're sufficiently advanced and it is a "boot camp" style "hardening" of what you'll see from older libraries in the wild.

Though I've not seen this exact form, there are libraries with code something like this you'd wish was never written this way.

There are those who will argue this is a good example of correct code. Ok, fine, but many, many coding standards stress that clarity and obvious purpose is key to reduced maintenance costs and bug generation. If a snippet has to be read a few times to figure out what it does, it is likely to be more of a problem than code that is clear and obvious.

I'd have to test, but I think many processors are faster doing this with a repeated move/copy instruction (with a tail dealing with what's left over in some cases), which hints such an optimization written out in C could be quite a waste of effort because the compiler may have better ideas anyway.

I don't think this deals with alignment, which is important for performance especially in modern machines. It deals with the "leftovers" first, which could be anything, thus leaving the "8 at a time" possibly crossing an alignment boundary that could negatively impact performance. If allocation of the integer arrays is aligned, this could be a net loss. It is better to have the bulk of the array processed in aligned chunks, then deal with "cleanup" of the leftovers afterwards. This appears to be the kind of code from ages past, when 1Mbyte (not Gbyte) of RAM was over tens of thousands of dollars and CPU's didn't typically have cache.

This might have worked well in, say, an old '286 and even a '386 intel CPU in the 80's, but post Pentium it may not perform as well as alternatives.
Last edited on
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <iostream>
#include <chrono>
#include <algorithm> // copy
#include <cstring>   // memcpy

const int Size = 1000000;

void send(int* to, int*from, int count) {
    int n = (count + 7) / 8;
    switch (count % 8) {         // Duff's device
    case 0: do { *to++ = *from++;
    case 7:      *to++ = *from++;
    case 6:      *to++ = *from++;
    case 5:      *to++ = *from++;
    case 4:      *to++ = *from++;
    case 3:      *to++ = *from++;
    case 2:      *to++ = *from++;
    case 1:      *to++ = *from++;
            } while (--n > 0);
    }
}

int a[Size], b[Size];

int main() {
    using namespace std::chrono;
    auto& now = high_resolution_clock::now;
    auto timing = [](auto start, auto stop) {
        return duration_cast<microseconds>(stop - start).count();
    };

    {
    auto start = now();
    send(b, a, Size);
    auto stop = now();
    std::cout << timing(start, stop) << '\n';
    }

    {
    auto start = now();
    std::memcpy(b, a, Size * sizeof *b);
    auto stop = now();
    std::cout << timing(start, stop) << '\n';
    }

    {
    auto start = now();
    std::copy(a, a + Size, b);
    auto stop = now();
    std::cout << timing(start, stop) << '\n';
    }
}

It turns out it's hard to test the timings like this since the first algorithm to be tested primes the cache for the next ones. Maybe they need to be run separately.
Last edited on
I took the view of trying "aligned" size, that an array with size divisible by 8, then another that wasn't.

I chose 1073741828 and 1073741820

Simple run for the aligned was 2.72 seconds, non-aligned was 3.04 seconds (avg of 10 runs).

Changing the code to use memcpy showed about 2.75

Alignment made significant difference. Setting for 16 byte alignment didn't impact memcpy much, but the "send" function non-aligned hit 4 seconds, while the aligned was about 3.06

That seems to suggest my theory is applicable, but certainly not proof. Send can work well in some situations.


Last edited on
@aligh,
1
2
3
4
5
6
7
int main(){
	int ar[]{0,1,2,3,4,5,6,7,8,9};
	int *p1=&ar[0];
	int *p2=&ar[4];	
	
	*p1++ = *p2++;
}

here, for *p1++ = *p2++;

a) *p1=*p2;
ar[0]=ar[4]

b) then p1++ and p2++
p1 point to ar[1] and p2 point to ar[5]
Thanks guys and sorry for the late reply. I was sick a few days.

Really helpful comments. Here's what I understand after your comments and trying to decipher the code (and BTW, it's from Bjarne Stroustrup book The C++ Programming...):

The code copies count number of integers from one array to another.
It does this by dealing first with the leftovers compared to packages of eight integers. This is done with the count%8 and the switch statement.
There is no break, so every case after the remainder is executed. Using a suffix ++, the pointer for both source and destination arrays are incremented at each copy.
When the leftovers are done, a do while condition (that is stuffed inside the switch) is checked against n that has been populated at the beginning with the number of full eight integer packages. It has a prefix --, so if it was just one line the code exits the loop, otherwise, the rest of the full eight integer packages are copied.

Hope I got it right.
Really appreciate your help and time.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
using namespace std;


int main(){	
	int count=3;
	switch (count){
		case 0:
			do{
				cout<<"start";			
				case 7:cout<<"7";
				case 6:cout<<"6";
				case 5:cout<<"5";
				case 4:cout<<"4";
				case 3:cout<<"3";
				case 2:cout<<"2";
				case 1:cout<<"1";
				default: cout<<".\n";
				count--;
			}while (count>0);
	}	
return 0;
}

output:
321.
start7654321.
start7654321.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
using namespace std;


int main(){	
	int count=3;
	switch (count){
		case 0:
			while (count>5){
				cout<<"start";			
				case 7:cout<<"7";
				case 6:cout<<"6";
				case 5:cout<<"5";
				case 4:cout<<"4";
				case 3:cout<<"3";
				case 2:cout<<"2";
				case 1:cout<<"1";
				default: cout<<".\n";
				count--;
			}
	}	
return 0;
}

output:
321.
n = (count+7)/8; to obtain the ceil of the division
would be equivalent to
1
2
3
n = count/8;
if(count%8)
   ++n;



> IMNSHO
DEA

> the first algorithm to be tested primes the cache for the next ones
then make some untimed runs first
also, don't use a wall clock.

> and BTW, it's from Bjarne Stroustrup book
¿what's trying to teach there?

and BTW, it's from Bjarne Stroustrup book The C++ Programming...


I've searched 4th and 3rd edition texts (I have the ebooks of it)....what edition is this code from?

It's not in 4th or 3rd.

I searched every use of the word "case" in the text through 1100+ pages in 4th and about 1000 in 3rd, and nothing like this appears in those.

Is it 2nd or 1st edition?

What's the context?




Last edited on
It's the fourth edition, but the exercises. They've been moved out of the book itself.

http://www.stroustrup.com/4thExercises.pdf

X.10 Statements #3
@aligh,

Ah, now it makes a little more sense. @ne555 was wondering (love the reference to the IC) as was I why such an example?

I just didn't see Stroustrup writing and suggesting this type of construction, and now I see this is in a section where he's showing how examples "from the wild" may be difficult to understand, with exercises on how to re-write such code for better clarity or modern usage.

Indeed, with this example he qualifies this isn't a recommended style. The example even asks "Why would anyone write something like that?"

I think the "edge case" or "leftover handling" is backwards in this example, so it works as a short solution if optimization wasn't really the goal, it may fit better in older hardware and wasn't particularly harmful, just not clear or obvious.

There are libraries, in use for decades, usually in C with code along these puzzling lines. They've worked for years, have given no trouble, are even prerequisites to C++ libraries, and so much so that hardly anyone even looks at the code anymore. I don't have specific examples in mind, but these are along the lines of the old zip or jpeg file decompression libraries in C (though I'm not saying they're the ones with an example).

There are plenty of times when one comes across some code that just forces the question, not quite as politely as Stroustrup put it - a variation of WFT.


I understand now how the code works, but there is still something that nags me:

Why would anyone need to write code like that?
What is wrong with a simple for loop?
Is it the possibly smaller size of the machine code that is compiled?
Is it the possibly faster machine code that compiles due to less loops?
Why not writing machine code in the first place or optimize the assembly output of the compiler manually while we're at it?
Or is it just the blown-up ego (and probably pure inexperience with maintenance) of some developer who felt good about writing stuff that worked but nobody understood that is here at play? (genuinely asking; no offense meant)

Thank you for you insights!
https://en.wikipedia.org/wiki/Duff%27s_device

> Why would anyone need to write code like that?
Early 1980's, plus much slower machines without fancy hardware that could block-copy memory to memory without bothering the CPU.

Also, compilers of the time were little better than mechanical transformations from source to asm (you got what you wrote).

With modern optimising compilers, you get what you meant.

-funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop, -fweb and -frename-registers. It also turns
on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations). This option makes code larger, and may or may not make it run faster.

The compiler can unroll loops for you, no weird programming tricks involved.


> What is wrong with a simple for loop?
As you've surmised, an unrolled loop performs 8 copies for every branch (12% overhead).
A single copy+branch is a 50% overhead.

> Why not writing machine code in the first place or optimize the assembly
> output of the compiler manually while we're at it?
As soon as you step into assembler, you lose all portability.

It does say right there in the Wiki article what the motivation for writing it was, and that it was written over 35 years ago, when C++ existed as no more than a pre-processor for C and a bag of wild ideas.

Duff wasn't showing off or indulging his ego; he was a skilled programmer writing excellent C code.

Here are his notes on it, from 1988; https://www.lysator.liu.se/c/duffs-device.html

Last edited on
Ahh, I see now.

It never occurred to me to search for "Duff's Device", because "Duff" reminded me of "stupid" or "Simpsons" or something like that and I thought that was just a funny comment by Stroustrup. I don't know why. I'm not a native speaker.

Anyway, thank you all. Learned a lot.

Kudos for finding a reference to this. To think I was a young programmer finding my own way into industry when this was first written is kind of chilling in a way.

Those two links cover what's been covered here.

Knuth wrote about partial unrolled loops in "The Art of Computer Programming" as far back as the late 60's, and it was recognized by programmers and scientists in the 50's.

The links state "Duff's Devices " was a curious cross of two language features in '83.

At that time the Cray 2 was still in development (it was released in '85), and the Cray 1, still top dog for single CPU performance per cycle, were both relatively mediocre compared to modern machines. 2 Mbytes of RAM (extremely slow by today's standards) would be $600 or more.

A modern compiler would not fit in the RAM space of a computer costing over $1 million back then. Even if it could, the compilation time would be hours or days to produce what we get in just 1 second now.

Back then C was still often referred to as an assembler (usually by high level language programmers, like COBOL). It was used very much like an assembler (though CPU independent), and it's origin was to write the UNIX operating system for portability (the first version (primitive example) of UNIX was in assembler).

So, as to why someone would write like this - in a machine clocked below 50 Mhz, limited to maybe 1 Mbyte of RAM, priced well above personal means, this was an interpretation of a raw assembler concept in a small space.

The docs suggest it wasn't about performance. memcpy already existed.

I suspect the language experts building the formal foundations of C's spec may not have realized this use of the switch was possible. In assembler it would have merely been a list of labels, the loop the result of conditional jumps.


Topic archived. No new replies allowed.