You cannot both have your pie and eat it. Decide if you want to have great parallel performance or if it's important to see the output of the algorithm while running the parallel loop.
The obvious offline solution is to store the plaintexts, keys and ciphertexts in arrays. In your case that would require 119 MiB (= 650000*(3*4*16)
bytes) in the original case and only 12 MiB in the case with 65000 trials. Nothing that a modern machine with GiBs of RAM cannot handle. The latter case even even fits in the last-level cache of some server-class CPUs.
#define TRIALS 65000
int (*key)[16];
int (*pt)[16];
int (*ct)[16];
double timer;
key = malloc(TRIALS * sizeof(*key));
pt = malloc(TRIALS * sizeof(*pt));
ct = malloc(TRIALS * sizeof(*ct));
timer = -omp_get_wtime();
#pragma omp parallel for private(rnd,j)
for(i = 0; i < TRIALS; i++)
{
...
for(j = 0; j < 4; j++)
{
key[i][4*j] = (rnd[j] & 0xff);
pt[i][4*j] = key[i][4*j];
key[i][4*j+1] = ((rnd[j] >> 8) & 0xff) ;
pt[4*j+1] = key[i][4*j+1];
key[i][4*j+2] = ((rnd[j] >> 16) & 0xff) ;
pt[i][4*j+2] = key[i][4*j+2];
key[i][4*j+3] = ((rnd[j] >> 24) & 0xff) ;
pt[i][4*j+3] = key[i][4*j+3];
}
encrypt(key[i],pt[i],ct[i]);
}
timer += omp_get_wtime();
printf("Encryption took %.6f seconds\n", timer);
// Now display the results serially
for (i = 0; i < TRIALS; i++)
{
display pt[i], key[i] -> ct[i]
}
free(key); free(pt); free(ct);
To see the speed-up, you have to measure only the time spent in the parallel region. If you also measure the time it takes to display the results, you will be back to where you started.