SIMD: implement _mm256_max_epu64_

Hi.
I want to ask a simple question about SIMD.
I don't get the AVX512 in my CPU but want to have a _mm256_max_epu64.
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_max_epu64&expand=730,4929,3837,624,776,3705,776,3705,1011,931,3705,3616,3582,3591

How can we implement one?
1
2
3
4
5
6
7
inline __m256i __my_mm256_min_epu64_(__m256i a, __m256i b) {
  uint64_t *val_a = (uint64_t*) &a;
  uint64_t *val_b = (uint64_t*) &b;
  uint64_t e[4];
  for (size_t i = 0; i < 4; ++i) e[i] = (*(val_a + i) < *(val_b + i)) ? *(val_a + i) : *(val_b + i);
  return _mm256_set_epi64x(e[3], e[2], e[1], e[0]);
}

I am the OP and had this trivial one.
Do you have idea to improve it?
Last edited on
It's unrelated to the original question, but I urge you to use array syntax at line 5:
for (size_t i = 0; i < 4; ++i) e[i] = (val_a[i] < val_b[i]) ? val_a[i] : val_b[i]);
or even better:
for (size_t i = 0; i < 4; ++i) e[i] = std::min(val_a[i], val_b[i]);
Hi dhayden.
Thanks for your reply. Do these ways have different performances?
Do these ways have different performances?
Probably not. And if they do, I doubt that it's worth optimizing it.

Write code that is clear.
If it's too slow, profile it to find the actual bottlenecks.
Optimize the bottlenecks.
Topic archived. No new replies allowed.