memcpy to a data struct not working

Pages: 12
Above code crashed ONLY in ARMv7 (an embedded system from Critical-link SoM).
It works well in both Windows and normal Linux(Fedora, Ubuntu, CenterOS...).
The normal output is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
sizeof(Data):132, sizeof(Data1):144
td.time1 (seconds)                    134904.063520017
td.time2 (seconds)                    134904.063520017
td.distance (meters)                  0
td.timeType                           17
td.distType                           00000001
lat (-90, 90 degree]                  40.47028323706526
lng ((-180, 180 degree]               -86.99479230239133
alt (meters)                          185.1183599689319
velocity.North (m/s)                  -0.09746722131967545
velocity.East (m/s)                   0.08936696499586105
velocity.Down (m/s)                   0.02516745217144489
vehicle.roll (-180, 180]              -0.005625263853925954
vehicle.pitch (-90, 90]               -0.9466927912314657
vehicle.yaw [0, 360)                  358.045879831607
vehicle.WanderAngle (-180, 180]       -7.73908797043965e-05
vehicle.TrackAngle [0, 360])          137.4825286865234
vehicle.Speed (m/s)                   0.1322358250617981
vehicle.angularRateAbout.Longitudinal -0.001131730852648616
vehicle.angularRateAbout.Transverse   0.2598911225795746
vehicle.angularRateAbout.Down         -0.03723606094717979
vehicle.accelation.Longitudinal       -0.2083656936883926
vehicle.accelation.Transverse         0.122549943625927
vehicle.accelation.Down (m/s^2)       0.1446369141340256

alt:185.1183599689319
yes, d1 == d as expected.
solution 3 is correct.
solution 4 is correct.
> The 4th solution crashed on "Bus error" -- does anyone know why?

The bus error in this case is almost certainly due to unaligned access to data.

The line below in bold probably violates the alignment requirements for the type double on ARM processors.
1
2
3
4
 d.td.distance                           = *reinterpret_cast<double *>(source+i); i += 8;\
 d.td.timeType                           =                             source[i]; i += 1;\
 d.td.distType                           =                             source[i]; i += 1;\
 d.lat                                   = *reinterpret_cast<double *>(source+i); i += 8;\ 
Last edited on
As I posted previously:
 
char *source = new char [size + 1];

where size=132, and in your pointed place (bold-highlighted) i=26, which is in range. How can the cast "violates the alignment requirements for the type double on ARM processors." ? Please explain in detail. Thanks.
I don't understand which field you said "Using that junk as a pointer explains the bus error." ?

Sorry, I'm talking nonsense. I was under the impression the buffer source overlaid a Data1 struct.
https://www.heyrick.co.uk/armwiki/Unaligned_data_access#ARM_v7
Perhaps the compiler is generating an LDM instruction to load the double, maybe? Playing around with Godbolt, GCC seems to use LDR, not LDM.
The alignment requirements of a type place restrictions on the locations in storage (the address) at which objects of that type can be (optimally) placed.
https://msdn.microsoft.com/en-us/library/ms253949.aspx
https://en.wikipedia.org/wiki/Data_structure_alignment

Certain hardware platforms (eg. x86 and x64) support unaligned access to memory (at a performance cost),
while others do not.

Some operating system kernels (eg. windows) automatically trap and handle misaligned data access at the cost of severe run-time overhead.
"Floating-point loads and stores should be aligned. The kernel handles unaligned loads and stores transparently, but with significant overhead."
https://docs.microsoft.com/en-us/cpp/build/overview-of-arm-abi-conventions

Others do not, and when access to misaligned data occurs, the bus error is propagated to the offending program.
https://en.wikipedia.org/wiki/Bus_error#Unaligned_access


Note: If the GNU compiler (ARM) is being used, the option -mfloat-abi=soft may get rid of this error
(at a significant performance cost).
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#ARM-Options
Last edited on
Huh. I guess #2 is the only viable solution, then. Otherwise most double members of the struct will be inaccessible (short of memcpy()ing into a temporary).
A temporary buffer is not required as the types involved are TriviallyCopyable types.

Something along these lines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <cstring>
#include <type_traits>
#include <memory>

// copy sizeof(dest) bytes from srce to dest;
// update srce to point to the byte after the last byte that was copied
// invariant: the bytes being copied do not overlap
template < typename T > void copy_bytes( T& dest, const char*& srce )
{
    static_assert( std::is_trivially_copyable<T>::value,
                   "objects must of TriviallyCopyable types for a byte by byte copy" ) ;

    std::memcpy( std::addressof(dest), srce, sizeof(dest) ) ;
    srce += sizeof(dest) ;
}


And then:
1
2
3
4
5
6
7
8
copy_bytes( d.td.time1, source ) ;
copy_bytes( d.td.time2, source ) ;
copy_bytes( d.td.distance, source ) ;
copy_bytes( d.td.timeType, source ) ;
copy_bytes( d.td.distType, source ) ;
copy_bytes( d.td.lat, source ) ;
copy_bytes( d.td.lng, source ) ;
// etc. 
Thanks a lot to JLBorges, helios and mbozzi providing links to learn.

The code targets MitySOM-5CSx (Altera Cyclone V FPGA with an ARMv7 Cortex-A9 CPU Dual Core)

root@mitysom-5csx:~# uname -a
Linux mitysom-5csx 3.16.0+ #6 SMP Thu Aug 24 11:59:16 EDT 2017 armv7l GNU/Linux

The binary is built from Ubuntu host using:
arm-linux-gnueabihf-g++ -DHAVE_CONFIG_H -I. -g -O2 -MT main.o -MD -MP -MF $depbase.Tpo -c -o main.o main.cpp &&\
mv -f $depbase.Tpo $depbase.Po
/bin/bash ./libtool --tag=CXX --mode=link arm-linux-gnueabihf-g++ -g -O2 -o poslv5 main.o
libtool: link: arm-linux-gnueabihf-g++ -g -O2 -o poslv5 main.o

These parameters are copied from an existing project, which I am not sure correct or not. I sent out an email to the board manufacture ....
If it is not C++11, the copy bytes function would look like:
1
2
3
4
5
6
7
8
9
10
11
#include <cstring>

// copy sizeof(dest) bytes from srce to dest;
// update srce to point to the byte after the last byte that was copied
// invariant: the type T is a POD type
// invariant: the bytes being copied do not overlap
template < typename T > void copy_bytes( T& dest, const char*& srce )
{
    std::memcpy( &dest, srce, sizeof(dest) ) ;
    srce += sizeof(dest) ;
}
Thank you, JLBorges. Your suggestion is tested in solution 5 and 6.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
inline void cpy(unsigned short &target, char *source) {memcpy((char *)(&target), source, 2);}
inline void cpy(unsigned long  &target, char *source) {memcpy((char *)(&target), source, 4);}
inline void cpy(double         &target, char *source) {memcpy((char *)(&target), source, 8);}
inline void cpy(float          &target, char *source) {memcpy((char *)(&target), source, 4);}

#define FIELDS_CPY                                        int i = 0;\
    cpy(d.td.time1                              , source+i); i += 8;\
    cpy(d.td.time2                              , source+i); i += 8;\
    cpy(d.td.distance                           , source+i); i += 8;\
        d.td.timeType                           = source[i]; i += 1;\
        d.td.distType                           = source[i]; i += 1;\
    cpy(d.lat                                   , source+i); i += 8;\
    cpy(d.lng                                   , source+i); i += 8;\
    cpy(d.alt                                   , source+i); i += 8;\
    cpy(d.velocity.North                        , source+i); i += 4;\
    cpy(d.velocity.East                         , source+i); i += 4;\
    cpy(d.velocity.Down                         , source+i); i += 4;\
    cpy(d.vehicle.roll                          , source+i); i += 8;\
    cpy(d.vehicle.pitch                         , source+i); i += 8;\
    cpy(d.vehicle.yaw                           , source+i); i += 8;\
    cpy(d.vehicle.WanderAngle                   , source+i); i += 8;\
    cpy(d.vehicle.TrackAngle                    , source+i); i += 4;\
    cpy(d.vehicle.Speed                         , source+i); i += 4;\
    cpy(d.vehicle.angularRateAbout.Longitudinal , source+i); i += 4;\
    cpy(d.vehicle.angularRateAbout.Transverse   , source+i); i += 4;\
    cpy(d.vehicle.angularRateAbout.Down         , source+i); i += 4;\
    cpy(d.vehicle.acceleration.Longitudinal     , source+i); i += 4;\
    cpy(d.vehicle.acceleration.Transverse       , source+i); i += 4;\
    cpy(d.vehicle.acceleration.Down             , source+i); i += 4;\
        d.AlignmentStatus                       = source[i];

void solution5(char *source, Data &d) {
    FIELDS_CPY
}

void solution6(char *source, Data1 &d) {
    FIELDS_CPY
}

Here is output from ARMv7 processors:
1
2
3
4
5
6
7
8
9
10
11
12
alt:185.1183599689319
solution 1 is correct.
solution 2 is correct.
solution 3 is correct.
solution 5 is correct.
solution 6 is correct.
looping N = 99999999	
1: alt: 185.118360044982	Time elapsed 10264
2: alt: 185.118360044982	Time elapsed 13894
3: alt: 185.118360044982	Time elapsed 29666
5: alt: 185.118360044982	Time elapsed 29413
6: alt: 185.118360044982	Time elapsed 18087

where solution5 is 29413/18087=1.6 time slower than solution6, which reflects penalty from #pragma pack on Data struct.
Last edited on
On Ubuntu, solution 2 is the fastest.
1
2
3
4
5
6
7
looping N = 999999990 (10 times more than previous N)
1: alt: 185.1183579319254306	Time elapsed 7834
2: alt: 185.1183579319254306	Time elapsed 5213
3: alt: 185.1183579319254306	Time elapsed 7587
4: alt: 185.1183579319254306	Time elapsed 6501
5: alt: 185.1183579319254306	Time elapsed 7634
6: alt: 185.1183579319254306	Time elapsed 6408

You could simplify your code a bit:
1
2
3
4
5
template <typename T>
void cpy(T &target, const char *&source) {
    memcpy(&target, source, sizeof(T));
    source += sizeof(T);
}
Now you don't need the increments in the macro.
Thanks helios. That's better.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// improved by helios
template <typename T>
void cpy(T &target, const char *&source) {
    memcpy(&target, source, sizeof(T));
    source += sizeof(T);
}
#define FIELDS_CPY                                     \
    cpy(d.td.time1                             , src); \
    cpy(d.td.time2                             , src); \
    cpy(d.td.distance                          , src); \
    cpy(d.td.timeType                          , src); \
    cpy(d.td.distType                          , src); \
    cpy(d.lat                                  , src); \
    cpy(d.lng                                  , src); \
    cpy(d.alt                                  , src); \
    cpy(d.velocity.North                       , src); \
    cpy(d.velocity.East                        , src); \
    cpy(d.velocity.Down                        , src); \
    cpy(d.vehicle.roll                         , src); \
    cpy(d.vehicle.pitch                        , src); \
    cpy(d.vehicle.yaw                          , src); \
    cpy(d.vehicle.WanderAngle                  , src); \
    cpy(d.vehicle.TrackAngle                   , src); \
    cpy(d.vehicle.Speed                        , src); \
    cpy(d.vehicle.angularRateAbout.Longitudinal, src); \
    cpy(d.vehicle.angularRateAbout.Transverse  , src); \
    cpy(d.vehicle.angularRateAbout.Down        , src); \
    cpy(d.vehicle.acceleration.Longitudinal    , src); \
    cpy(d.vehicle.acceleration.Transverse      , src); \
    cpy(d.vehicle.acceleration.Down            , src); \
    cpy(d.AlignmentStatus                      , src)

void solution5(char *source, Data &d) {
    const char *src = source;
    FIELDS_CPY;
}

void solution6(char *source, Data1 &d) {
    const char *src = source;
    FIELDS_CPY;
}
Benchmark on Windows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
alt:185.1183599689319
solution 1 is correct.
solution 2 is correct.
solution 3 is correct.
solution 4 is correct.
solution 5 is correct.
solution 6 is correct.
looping N = 9999999900
1: alt: 185.1183911535855	Time elapsed 7120
2: alt: 185.1183911535855	Time elapsed 7094
3: alt: 185.1183911535855	Time elapsed 7127
4: alt: 185.1183911535855	Time elapsed 7064
5: alt: 185.1183911535855	Time elapsed 7290
6: alt: 185.1183911535855	Time elapsed 7114

Two conclusions:
1) All solutions are equally efficient on Windows.
2) Intel i7-4790K CPU is 140 times faster than ARMv7 processor. (hard to believe this embedded unit is sooooo slow)
Last edited on
I wonder what the instructions/Joule and MIPS/USD ratings are for each one.
The clock difference is: Intel i7-4790K @4GHz, while ARMv7 @40MHz.
I was told "The ARM does not have a double precision floating point hard unit, only single precision (float). So double precision math is going to be emulated in software on the ARM and will be quite slow. "
1
2
3
4
5
                                          Typical I7   Cortex-A9
Double Precision hardware acceleration :  yes          no
Power Consumption                       : ~60 Watts    1 Watt
DDR memory bandwidth                    : ~16 GB/sec   0.80 GB/sec
Data Cache                              : ~8192 KB     32 KB 
this is normal lol. Many embedded processors are on par with 40 year old hardware. At least yours HAS a FPU.

you don't need a FPU at all to store doubles. If you do math on them, you will need to handle the problem, but just shuffling the bytes, you should not. 40 MHz is actually still quite powerful, that is still 40 million instructions per second. You can do an awful lot with that.

your conclusions may be a little faulty. In practice they may be all the same on windows, but it may also be that the task is so trivial to the better hardware that any inefficiency is still taking so little time as to be masked. Its likely one of the solutions is better, but the difference is so small it may even be difficult to even measure. Its not worth worrying about, its not your target system. Its also probably MORE than 140x faster due to other things that happen in modern hardware.

Lets ask the critical questions now.
- is it fast enough on the real hardware?
- if not, how fast does it need to be??

calling memcpy multiple times like this seems wrong to me. Can it not be done with a single call?
Last edited on
My initial thought of this testing is to find an efficient way of process GPS data, because it comes at high frequency (200 Hz). I treat them as black boxes, and wish to find which solution runs faster.

On Windows, six solutions give no difference, I perfer the simplest solution 1: one memcpy from receive data to a packed struct.
On Ubuntu, solution 2 has advantage: 1.5 times faster than solution 1.
On ARMv7, solution 1 is the simplest, the best. (If solution 4 crashes on reinterpret_cast, why solution 3 work?)

My real harware is MitySOM-5CSx (Altera Cyclone V FPGA with an ARMv7 Cortex-A9 CPU Dual Core) which is, actually, fast enough.
The GPS sends UDP packets at 200 Hz, and the hardware can process all of them without losing any.
Problem solved. Thanks a lot.
Last edited on
Topic archived. No new replies allowed.
Pages: 12