When I try to store and load 256bits to and from an AVX2 256bit vector, I'm not receiving expected output in release mode.
use std::arch::x86_64::*;
fn main() {
let key = [1u64, 2, 3, 4];
let avxreg = unsafe { _mm256_load_si256(key.as_ptr() as *const __m256i) };
let mut back_key = [0u64; 4];
unsafe { _mm256_storeu_si256(back_key.as_mut_ptr() as *mut __m256i, avxreg) };
println!("back_key: {:?}", back_key);
}
In debug mode:
back_key: [1, 2, 3, 4]
In release mode:
back_key: [1, 2, 0, 0]
The back half either isn't being loaded or stored and I can't figure out which.
What's weird is targeting a native CPU works. In release mode + RUSTFLAGS="-C target-cpu=native"
back_key: [1, 2, 3, 4]
I've even tried to rid myself of Clippy errors by forcing alignment to no avail (I'm not sure if the code below is even considered more correct).
use std::arch::x86_64::*;
#[repr(align(256))]
#[derive(Debug)]
struct Key([u64; 4]);
fn main() {
let key = Key([1u64, 2, 3, 4]);
let avxreg = unsafe { _mm256_load_si256(&key as *const _ as *const __m256i) };
let mut back_key = Key([0u64; 4]);
unsafe { _mm256_storeu_si256((&mut back_key) as *mut _ as *mut __m256i, avxreg) };
println!("back_key: {:?}", back_key);
}
- Why is this happening?
- Is there a fix for this specific use case?
- Can this fix be generalized for user input (e.g.: if I wanted to take a byte slice as user input and do the same procedure)