Is this the fastest way to reformat a buffer of 64b values to 16b?

Question

I have a datastream which outputs what are physically 64bit values to a buffer. When the buffer reaches a certain level, it needs to be reformatted to consecutive 16bit values. The real values are never more than 24 of the 64 bits of each value produced by the datastream, so this amounts to truncating a 24b value to 16b and rearranging the buffer so the values are now consecutive. I believe I have found the fastest way to do this, however I am not sure if there are optimizations I could be missing or faster ways provided by C++ standard utilities. Below is a MRE showing my reformatting function as well as a test harness to produce data like what I am encountering and time the reformatting.

#include <iostream>
#include <chrono>
#include <unistd.h>

int num_samples = 160000;

void fill_buffer(uint8_t** buffer){
  *buffer = (uint8_t*)malloc(num_samples * sizeof(uint64_t));
  for (int i = 0; i < num_samples; i += 8){
    (*buffer)[i] = rand() % 0xFF;
    (*buffer)[i + 1] = rand() % 0xFF;
    (*buffer)[i + 2] = rand() % 0xFF;
  }
}

void reformat_1(uint8_t* buf){
  uint64_t* p_8byte = (uint64_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;

  for (int i = 0; i < num_samples; i++){
    p_2byte[i] = p_8byte[i] >> 8;
  }
}

int main(int argc, char const* argv[]){
  uint8_t* buffer = NULL;

  fill_buffer(&buffer);
  auto start = std::chrono::high_resolution_clock::now();
  reformat_1(buffer);
  auto stop = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
  std::cout << "Time taken by function one: " << duration.count() << " microseconds" << std::endl;

  return 0;
}

Also willing to hear feedback on my benchmarking setup, I find it interesting that with -O3 I get ~130uS on my actual sample data read from a file, while with randomly generated data, I am seeing closer to 1800uS, so this is apparently not a perfectly representative example.

One other thing I will note that I would think would work against my actual times (vs synthetic) but apparently not: While the num_samples is a magic number here, in practice, it is calculated and usually constant (not always), but not something that the compiler would replace with a constant as to unroll loops or etc (I think).

Your `fill_buffer` function will leave a lot of elements in the array you create uninitialized. Those uninitialized elements will have indeterminate values. Using indeterminate values in any way in C++ leads to undefined behavior. — Some programmer dude, Jun 28 '23 at 20:29
Besides that, Your code is more C than C++. You shouldn't be using `malloc` in C++. You shouldn't need C-style casting in C++ (doing that is usually a sign that you're doing something wrong). You shouldn't be using pointers to emulate references, which C++ have natively. Why aren't you using standard C++ classes and algorithms? — Some programmer dude, Jun 28 '23 at 20:30
This feels more like a codereview than trying to solve a specific problem — Ted Lyngmo, Jun 28 '23 at 20:33
`malloc`? `rand`? C-style casts? `NULL` rather than `nullptr`? Double pointers? This code could use a serious "modern C++" update. — Jesper Juhl, Jun 29 '23 at 01:28
@Someprogrammerdude good catch, I didn't know that was undefined since technically I'm never accessing those elements directly, I would think? It has worked perfectly so far with no warnings or bugs so I'll thank the compiler. I wonder if that's why the performance is slower with synthetic data though, I could see the cast causing it to cache more data than is needed. I'm more worried overall about the function which reformats the buffer, but I'll edit it to zero out the buffer initially when I'm at a PC. To be clear, the buffer is never uninit like that in practice, just this attempted MRE — Douglas B, Jun 29 '23 at 02:45
@Someprogrammerdude and to address the rest, part because I'm only worried about the one function, which currently is taking in a pointer and I can't change that easily. I can use more c++ compliant things in the module I'm working on if I can justify it, but input and output arent really negotiable and unless it improves performance, it's not likely to be adopted. I appreciate the input however — Douglas B, Jun 29 '23 at 02:52

score 1 · Accepted Answer · answered Jun 29 '23 at 01:10

1

This micro-improvement increases performance by ~10%:

void reformat_2(uint8_t* buf){
  uint32_t* p_8byte = (uint32_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;
  uint16_t* p_2end  = p_2byte + num_samples;

  while(p_2byte < p_2end){
    *p_2byte++ = *p_8byte >> 8;
    p_8byte += 2;
  }
}

For see more clear numbers, I increased buffer size 100x, to 16M entries.

answered Jun 29 '23 at 01:10

olegarch

3,670
1
20
19

Interesting, can you explain how you came to this/what reasoning would make you expect this to be faster? I will have to look at the assembly output but would appreciate some insight – Douglas B Jun 29 '23 at 15:07
1

1. merging together index `p_2byte` and loop counter. 2. access to RAM by 32-bit words, not 64-bit ones, this requires less bus loops. – olegarch Jun 29 '23 at 17:18

Is this the fastest way to reformat a buffer of 64b values to 16b?

1 Answers1