13

When I try to store and load 256bits to and from an AVX2 256bit vector, I'm not receiving expected output in release mode.

use std::arch::x86_64::*;

fn main() {
    let key = [1u64, 2, 3, 4];
    let avxreg = unsafe { _mm256_load_si256(key.as_ptr() as *const __m256i) };
    let mut back_key = [0u64; 4];
    unsafe { _mm256_storeu_si256(back_key.as_mut_ptr() as *mut __m256i, avxreg) };
    println!("back_key: {:?}", back_key);
}

playground

In debug mode:

back_key: [1, 2, 3, 4]

In release mode:

back_key: [1, 2, 0, 0]

The back half either isn't being loaded or stored and I can't figure out which.

What's weird is targeting a native CPU works. In release mode + RUSTFLAGS="-C target-cpu=native"

back_key: [1, 2, 3, 4]

I've even tried to rid myself of Clippy errors by forcing alignment to no avail (I'm not sure if the code below is even considered more correct).

use std::arch::x86_64::*;

#[repr(align(256))]
#[derive(Debug)]
struct Key([u64; 4]);

fn main() {
    let key = Key([1u64, 2, 3, 4]);
    let avxreg = unsafe { _mm256_load_si256(&key as *const _ as *const __m256i) };
    let mut back_key = Key([0u64; 4]);
    unsafe { _mm256_storeu_si256((&mut back_key) as *mut _ as *mut __m256i, avxreg) };
    println!("back_key: {:?}", back_key);
}
  1. Why is this happening?
  2. Is there a fix for this specific use case?
  3. Can this fix be generalized for user input (e.g.: if I wanted to take a byte slice as user input and do the same procedure)
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Nick Babcock
  • 6,111
  • 3
  • 27
  • 43
  • Are you sure an array of 4 i64 can be considered as "256-bit aligned memory location pointed to by *a" as far as I know 4 i64 are i64 aligned – Stargateur Sep 20 '18 at 21:45
  • Even if alignment was a problem, the symptom of that would be a crash not incorrect output (`vmovaps` with an unaligned address generates a fault) – harold Sep 20 '18 at 21:48
  • look like a bug in LLVM – Stargateur Sep 20 '18 at 22:00
  • 1
    `println!("{:?}", avxreg);` allow to say that `load` is already the problem and use `_mm256_loadu_si256` fix it, but still the store still make wrong output – Stargateur Sep 20 '18 at 22:11
  • 2
    Reading the [docs more closely](https://doc.rust-lang.org/nightly/std/arch/), it looks like I should extract this into another function and use `#[target_feature(enable = "avx2")]`, [which works](https://play.rust-lang.org/?gist=a441cd23181627272af120b51e23e308&version=stable&mode=release&edition=2015). So I think that answers questions 2, 3, but idk about 1 – Nick Babcock Sep 20 '18 at 22:41
  • Oh nice one, well you can auto answer you ;), and for the 1, ask about undefined behavior is useless as anything can (and should) happen. – Stargateur Sep 20 '18 at 23:05

1 Answers1

3

After more thoroughly reading the docs, it became clear that I had to extract the body into another function and force that function to be compiled with AVX2 by annotating it with

#[target_feature(enable = "avx2")]

Or compile the entire program with

RUSTFLAGS="-C target-feature=+avx2" cargo run --release

The first option is better because it guarantees that the SIMD instructions used in a function are compiled appropriately, it's just on the caller to check their CPU has those capabilities before calling with is_x86_feature_detected!("avx2"). All this is documented, but it would be amazing if the compiler could warn with "hey, this function uses AVX2 instructions, but was not annotated with #[target_feature(enable = "avx2")] and the program was not compiled with AVX2 enabled globally, so calling this function is undefined behavior". It would have saved me a lot of headache!

Since relying on undefined behavior is bad, our initial program on the playground should be written as:

use std::arch::x86_64::*;

fn main() {
    unsafe { run() }
}

#[target_feature(enable = "avx2")]
unsafe fn run() {
    let key = [1u64, 2, 3, 4];
    let avxreg = _mm256_load_si256(key.as_ptr() as *const __m256i);
    let mut back_key = [0u64; 4];
    _mm256_storeu_si256(back_key.as_mut_ptr() as *mut __m256i, avxreg);
    println!("back_key: {:?}", back_key);
}

Some notes:

  1. main cannot be unsafe and thus can't be annotated with target_feature, so it is necessary to extract into another function
  2. This still assumes the x86_64 CPU running the code has avx capabilities, so make sure you check before calling
  3. It's not worth looking into why the debug version gives correct results, as running it under release on my home computer also gives correct results (under certain incantations). Looking at assembly shows that LLVM optimized one way or the other, but it is not particularly insightful.
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Nick Babcock
  • 6,111
  • 3
  • 27
  • 43
  • 1
    My suspicion here is that you might be able to be a bit more specific. In particular, something like, "you cannot call a function whose ABI depends on AVX registers from a function that is itself not compiled for AVX." So in your case, `main` is not compiled with AVX, but you're calling a routine where a `__m256i` appears in the function signature. You're modified code no longer does this, since a AVX vector does not appear in the type of `run`. – BurntSushi5 Sep 21 '18 at 14:48