Please evaluate and propose a more efficient implementation based on bit operations in a 32 bit machine

I am working on raspberry pi3 and it has the ARM cpu which prefers to do memory read and write at 32 bits. I have two processess zma and zmc. Zmc processes video frames from an rtsp camera feed, does a bit of hardware h264 encoding only to produce sideinfo consisting of motion vectors. The video frame is basically subdivided into tiles of 16x16 pixels and they span left to right and then top to bottom. I need to check if each macroblock's motion vector has had significant displacement and if it has, I need to communicate this information to zma. Zma checks the coordinates of the macroblock if it fits in polygon region of interest. It then counts the total number of these macroblocks that fit in the polygon and comes up with a score for motion. Zma and Zmc communicate via mmap. Zmc writes to a buffer and zma loads the buffer from memory. They do this with each frame and the buffer is part of a ringbuffer. If zma is too slow in processing the buffer, then zmc catches up with it on the next round and a buffer overrun happens. I suspected that the polygon inclusion test was too slow for this and I wanted to exploit the fact that the macroblocks are tiled and also wanted to save some memory. So I decided to have zmc utilize a buffer that is 1024 bytes long ( enough bits to store the state of each macroblock with a frame size of (1920x1080)/256/8=1012.5). I have some spare bytes at the end ( or the front ) for some header statistical info. So in ZMC each bit of this data buffer will represent the macroblocks in order and each bit will have to be set to 1 to indicate the motion vector has significant displacement, 0 otherwise. ZMA will as part of its initialization, create its own 1024 byte buffer and it will check to see if the coordinates of each "tile" in the frame is included in the polygon ROI. If a "tile" is inside the ROI, then it marks the bit as 1, 0 otherwise. This moves the polygon test into initialization, away from realtime where I hope simple bit operations would be a faster implementation. So when zmc sends the info to zma, zma just needs to perform an AND operation for each bit . It then counts the total of bits that are set after the AND operation and derives a score from that.

On top of all this, the memory reads and writes must be 32 bit aligned for performance. I'm open to any suggestions as my knowledge of this topic is limited.

My solution was to load the bits into a 4 byte word and then do memcpy operation every 4 bytes using a mvect_buffer that is allocated with malloc.

Q1. Would it be possible/better to allocate the entire 1024 byte as a local function variable so that it is in the stack ( and presumably avoiding multiple read and writes to heap) and then just do a single memcpy of 1024 bytes at the end?

Q2. Is there a fast solution to accessing individual bits in the heap so I can operate one bit at a time on the buffer? This would certainly give me a more flexible solution for the case when the macroblocks do not come in any order ( which is the case for the second type of macroblocks derived from software h264 decode ).

Q3. If my implementation as far as memory reads and writes is the most performant solution, is there a faster technique of manipulating the bits than what I have so far?

Q4. Is there a faster abs function implementation?

Right now the rtsp camera is sending 20 fps. However, zmc and zma process at 15 fps. I think the extra computation is reducing the capture rate. I want to increase the capture rate to as close as the camera fps.

Here is the implementation for ZMC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

uint8_t * mvect_buffer;
/..../
if(buffer->flags & MMAL_BUFFER_HEADER_FLAG_CODECSIDEINFO) {
                        
                        uint16_t t_offset=0; 
                      
                        
                        mmal_buffer_header_mem_lock(buffer);
                        uint16_t size=buffer->length/sizeof(mmal_motion_vector);
                        struct mmal_motion_vector mvarray[size];
                        uint32_t registers;
                  
                        
                        //copy buffer->data to temporary
                        memcpy(mvarray,buffer->data,buffer->length);
                        
                        mmal_buffer_header_mem_unlock(buffer);   
                        
                        
                        //START CODE 
                        registers=0;
                        uint16_t count=0;
                        uint16_t wcount=0;
                        for (int i=0;i < numblocks/4 ; i++) {
                            mmal_motion_vector mvs;
                            motion_vector mvt;
                            
                            memcpy(&mvs,mvarray+i,sizeof(mmal_motion_vector));
                            
                            if ((abs(mvs.x_vector) + abs(mvs.y_vector)) > 5) //Ignore if did not move pixels. 
                               registers =registers | (1 << count);
                            
                            count++;
                            
                            if (( count == 32) || (i == numblocks-1)) {
                               memcpy(mvect_buffer+t_offset , &registers, 4 ) ;  
                              
                               count=0;
                               wcount+=1;
                               t_offset+=4;
                               
                               registers=0;

                             }

                        }
                        //END CODE 
                         
                         
} 



Here is the code for ZMA initialization.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
void Zone::SetVectorMask() {
  uint16_t numblocks=0;
  uint16_t count=0;
  uint16_t wcount=0;
  
  uint32_t registers;
  uint16_t offset=0;
  
  numblocks= (monitor->Width()*monitor->Height())/256;
  Info("Setting up the motion vector mask with numblocks %d ", numblocks);
    for (uint16_t i=0 ; i< numblocks ; i++) {
  
    uint16_t xcoord = (i*16) % (monitor->Width() + 16);  //these blocks are tiled to cover the entire frame and are 16x16 size
    uint16_t ycoord = ((i*16)/(monitor->Width() +16))*16;
    
    if (polygon.isInside(Coord(xcoord,ycoord))) {//coordinates inside polygon 
           registers =registers | (1 << count);
          
    }    
           
    
     count++;
    if (( count == 32) || (i == numblocks-1)) {
      memcpy(zone_vector_mask+offset , &registers, 4 ) ;  
      //std::cout << "Count in "<< count << std::endl;
      count=0;
      wcount+=1;
      offset+=4;
     
      
      registers=0;

    }
    
      
                             
}
 Info("Done setting up zone vector mask");

  
}	


Here is the code for ZMA as it handles the buffer in realtime from ZMC.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
  uint8_t * mvect_buffer;
  uint16_t width; 
  uint16_t height;
  uint16_t numblocks=(width*height)/256;
  /..../
 
  if (mvect_buffer) {
      
      
      //Info("Analysing mvect buffer with numblocks %d", numblocks);
      
      uint16_t offset=4;
      
      uint16_t wcount=0;
      
      uint32_t mask=0;
      uint32_t buff=0;
      uint32_t res=0;
      uint16_t c=0;
      for (int i = 0; i < numblocks/4; i++) {
           
           memcpy(&mask, zone_vector_mask+offset, sizeof(mask));
           memcpy(&buff, mvect_buffer+offset, sizeof(mask));
           res= mask & buff;
           offset=offset+4; 
      
           //got this from the net 
           c =  ((res & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
           c += (((res & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
           c += ((res >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
           vec_count+=c;
                            

      }
        
}   
    



Thanks,
Chris
Last edited on
Q1 ... what about a static local variable? Then it can be a pointer or a stack variable, does not matter. This often is a performance gain for a frequently called function. This is NOT thread safe.

Q2 yes, there are lots of ways to get at bits efficiently. The type should be unsigned int32, from the sound of it. then you can just access them: variable &1 (first bit), variable &2 (second bit) variable & 4 (third bit) ... etc. It does not get any better than this unless you want to store it as a vector of boolean, which is supposed to be internally optimized (and is doing the same thing I just said, &bit stuff under the hood) to make bit access cleaner. You can also use bitset or a bitfield member of something.

Q3. Not sure. But this is the right question. The fastest, most uber tweaked bubblesort in the world is still slow!

Q4: maybe? I worked, long, long, long ago on a system where doing the bit logic yourself was faster than the abs function. And it was MUCH faster to hack the bit on a double in that same era. But the only way to know is to DO it. Write it, test it, see if its faster. sqrt(x*x) is not faster, or junk like that ... but flipping the bit is very fast. It may also depend on whether your CPU has an 'abs' instruction or has to do it by hand. abs is one of those things that you can often 'factor out' by using obscure logic. That is, you can set up your code to not need the abs sometimes.

little stuff matters, but you really need to profile the code to find the slowness. Guessing is not good after the first pass, maybe the second pass.

eg I think your x & y coord can be cooked up with a += instead of all that math, or looked up from a lookup table for each monitor width you support.

you have small memcpys in a loop, can that be rewritten to do one big memcpy instead? This block of code concerns me. Forget abs, see what you can do about reduction of all the data copying /movement and moving bigger blocks instead of small ones. and for sure on a 32 bit system, a memcpy of 4 bytes is slower than letting it copy a 32 bit item internally. that is int x = y is faster than a memcpy into x from y unless you are on byte wide hardware (unheard of on anything but the smallest embedded systems, stuff that lacks an OS, like a wristwatch).

it looks like you need to maybe multi-thread this thing. How many cores do you have on that device?

Finally, let me remind (or inform) you about a rule of thumb called the time-space tradeoff. Which is not always true but 'in general' saving space costs time, and wasting space (with a purpose) saves time.

Last edited on
Q1: RAM is RAM. It makes no speed difference if the program allocates it to the heap or the stack.

Q2: I suspect you'd get a big speed boost by reading a 32-bit unsigned from the buffer into a local variable, doing the bit manipulations on that variable and then writing it back out. The reason is that the compiler should allocate the local variable to a CPU register and do all the bit manipulations on the register. But that's basically what you've done.

Looking at your code, why are you making the copy at line 29? Why not get rid of that and change line 31 to
if ((abs(mvarray[i].x_vector) + abs(mvarray[i].y_vector)) > 5)

Line 36: if i < numblocks/4 (from line 25), it will never be equal to numblocks-1
Q1: RAM is RAM. It makes no speed difference if the program allocates it to the heap or the stack.

I will argue that if its a function that is called a lot in a tight loop, allocation and deallocation on the heap is a substantial cost. That is why I said static might be useful. If it is only called infrequently its no difference.
Thanks guys for your replies.

there are lots of ways to get at bits efficiently. The type should be unsigned int32, from the sound of it. then you can just access them: variable &1 (first bit), variable &2 (second bit) variable & 4 (third bit) ... etc.


That looks like an & operation. I have to look into that. To check if the nth bit is on, I thought you would have to do a :
 
variable & (1<<n)


Is there a faster way to set a bit than :

registers =registers | (1 << count);

Regarding Q1, do you guys think it would be better to just allocate one big array of 4 byte words? This array will be 1024 bytes. If the array is stack allocated, then I wouldn't have to resort to using the uint32_t register variable and memcpying to and from it. I think arrays declared inside functions are stack allocated. That is to say, I don't have to make a trip to the heap when i access the individual elements. I could just operate on the array elements directly, then after I get done with all the blocks, just do a single memcpy of the entire array to mvect_buffer. I wouldn't need the register variable at all.

The program is multi-threaded. There will be as many zma-zmc pairs running concurrently as there are cameras if the resources allow.

I think the lookup table is a good idea.

Looking at your code, why are you making the copy at line 29? Why not get rid of that and change line 31 to
if ((abs(mvarray[i].x_vector) + abs(mvarray[i].y_vector)) > 5)


Made the change. I would be removing some more unecessary variables and copies as I make further optimizations.

Line 36: if i < numblocks/4 (from line 25), it will never be equal to numblocks-1


That was a mistake in the code from copy and paste from test code to actual code. I should be processing all blocks not just numblocks/4. The numblocks -1 was so I could process the last chunk of blocks that do not necessarily equal a total of 32. However, there is some fuzziness that can be tolerated in the data so I will just remove that and ignore the last partial 32 bit block for the sake of performance.

Thanks,
Chris
Regarding Q1, do you guys think it would be better to just allocate one big array of 4 byte words? This array will be 1024 bytes. If the array is stack allocated, then I wouldn't have to resort to using the uint32_t register variable and memcpying to and from it.
What makes you think you need to use the uint32_t variable to set the array if it's on the heap? You can still access the individual bytes of arrays on the heap.

I could just operate on the array elements directly, then after I get done with all the blocks, just do a single memcpy of the entire array to mvect_buffer.
Why don't you just operate on mvect_buffer directly?

1
2
3
4
5
6
7
8
void random_bits(uint8_t *mvect_buffer, size_t size){
    for (size_t i = 0; i < size; i++){
        auto &byte = mvect_buffer[i];
        byte = 0;
        for (int j = 0; j < 8; j++)
            byte |= ((rand() & 1) << j);
    }
}
Last edited on
I might have erroneously thought that it would be slow to operate on mvect_buffer directly as it was allocated with malloc and must therefore be on the heap. ( I don't think I posted that information earlier. ) I assumed working on it directly would mean reading/writing to/from the heap each time I accessed any part of it. If I have a copy of it on the stack, and accessed that instead, I thought it would be faster. Then the access to heap memory would only be when mvect_buffer was copied to the stack allocated array ( in the case of the analysing process ) and when the stack allocated array is copied back into mvect_buffer ( in the case of the capturing process ) . This would reduce the number of memcpy's to one per process when operating on mvect_buffer data.

Thanks,
Chris

The only architecture I know of where accessing the stack or the heap may take different amounts of time is in NUMA. The way to deal with that problem is not by not using the heap, but by ensuring that a socket only rarely needs to use the other socket's RAM. In other words, by keeping track of where a thread is running and on which memory it's operating.
On single socket architectures RAM has uniform access times.

You should not write your functions assuming that the stack is faster than the heap, because that's generally just not true, and you're making your code more complicated, and probably slower, than it needs to be.

But either way, in this case one way or another you're going to have to write 'size' bytes to mvect_buffer. Whether you do it via a single memcpy() or by setting each byte makes little difference. I recommend you try rewriting your function similar to how I did it above and compare the times of your current implementation and your new implementation. I'm pretty sure the new implementation will be at least as fast.
Last edited on
Topic archived. No new replies allowed.