why is this parallel function for computing the longest common subsequence slower than the serial?

Question

The parallel computation of the LCS follows the wavefront pattern. Here is the parallel function that is slower than a serial implementation. (I think that the number of diagonals(parallel) vs number of rows(serial) has something to do with it)

void parallelLCS(char * sequence_a, char * sequence_b, size_t size_a, size_t size_b) {
double start, end;

int ** dp_table = new int*[size_a + 1];

for (int i = 0; i <= size_a; i++)
    dp_table[i] = new int[size_b + 1];

for (int i = 1; i <= size_a; i++)
    dp_table[i][0] = 0;
for (int j = 0; j <= size_b; j++)
    dp_table[0][j] = 0;

int p_threads = 2;
int diagonals = size_a + size_b;

start = omp_get_wtime();
#pragma omp parallel num_threads(p_threads) default(none) firstprivate(p_threads,size_a,size_b,sequence_a,sequence_b) shared(dp_table,diagonals)
{
    for (int curr_diagonal = 1; curr_diagonal <= (diagonals - 1);) {
        int j = omp_get_thread_num() + 1;   //column index
        int i = curr_diagonal - j + 1;      //row index
        for (; j <= curr_diagonal; j += p_threads, i = i - p_threads) {
            if (i <= size_a && j <= size_b) {
                if (sequence_a[i] == sequence_b[j]) {
                    dp_table[i][j] = dp_table[i - 1][j - 1] + 1;
                } else if (dp_table[i - 1][j] >= dp_table[i][j - 1]) {
                    dp_table[i][j] = dp_table[i - 1][j];
                } else {
                    dp_table[i][j] = dp_table[i][j - 1];
                }
            }
        }
        curr_diagonal++;
#pragma omp barrier
    }
}
end = omp_get_wtime();

printf("\nParallel - Final answer: %d\n", dp_table[size_a][size_b]);
printf("Time: %f\n", end - start);

//Delete dp_table
for (int i = 0; i <= size_a; i++)
    delete [] dp_table[i];
delete [] dp_table;
}

and here is the serial function

void serialLCS(char * sequence_a, char * sequence_b, size_t size_a, size_t size_b) {
double start, end;
int ** dp_table = new int*[size_a + 1];
for (int i = 0; i <= size_a; i++)
    dp_table[i] = new int[size_b + 1];

for (int i = 1; i <= size_a; i++)
    dp_table[i][0] = 0;
for (int j = 0; j <= size_b; j++)
    dp_table[0][j] = 0;

start = omp_get_wtime();
for (int i = 1; i <= size_a; i++) {
    for (int j = 1; j <= size_b; j++) {
        if (sequence_a[i] == sequence_b[j]) {
            dp_table[i][j] = dp_table[i - 1][j - 1] + 1;
        } else if (dp_table[i - 1][j] >= dp_table[i][j - 1]) {
            dp_table[i][j] = dp_table[i - 1][j];
        } else {
            dp_table[i][j] = dp_table[i][j - 1];
        }
    }
}
end = omp_get_wtime();
printf("\nSerial - Final answer: %d\n", dp_table[size_a][size_b]);
printf("Time: %f\n", end - start);

//Delete dp_table
for (int i = 0; i <= size_a; i++)
    delete [] dp_table[i];
delete [] dp_table;
}

...thought I'd add the test function

#include <cstdlib>
#include <stdio.h>

#include <omp.h>

void serialLCS(char * sequence_a, char * sequence_b, size_t size_a, size_t size_b);
void parallelLCS(char * sequence_a, char * sequence_b, size_t size_a, size_t size_b);

int main() {

size_t size_a;
size_t size_b;

printf("Enter size of sequence A: ");
scanf("%zd",&size_a);
printf("Enter size of sequence B: ");
scanf("%zd",&size_b);

//keep larger sequence in sequence_a
if (size_b > size_a) size_a ^= size_b ^= size_a ^= size_b;

char * sequence_a = new char[size_a + 1];
char * sequence_b = new char[size_b + 1];
sequence_a[0] = sequence_b[0] = '0';

const size_t alphabet_size = 12;
char A[alphabet_size] = {'A', 'T', 'G', 'C', 'Q', 'W', 'E', 'R', 'Y', 'U', 'I', 'O'};
char AA[alphabet_size] = {'T', 'C', 'A', 'G', 'R', 'E', 'W', 'Q', 'O', 'I', 'U', 'Y'};

for (size_t i = 1; i < size_a; i++) {
    sequence_a[i] = A[rand() % alphabet_size];
}
for (size_t i = 1; i < size_b; i++) {
    sequence_b[i] = AA[rand() % alphabet_size];
}

serialLCS(sequence_a, sequence_b, size_a, size_b);
parallelLCS(sequence_a, sequence_b, size_a, size_b);

delete [] sequence_a;
delete [] sequence_b;

return 0;
}

How much slower? It could be that if your input data is too small then the time taken to create the threads exceeds the time saved by computing in parrellel. Try running with very large sequences and see if it is still slower. — Eamonn McEvoy, Nov 14 '12 at 00:27
I have tried running it with sequences of 20000 characters.. but the runtime is about the same. — frkvr, Nov 14 '12 at 00:39
the parallel function seems to scale well,..although i only have access to a quad core — frkvr, Nov 14 '12 at 00:45
@dreamcrash i am interested in actual runtimes.. for two 10000char sequences the serial takes 1.8 sec and the parallel 7.3 using 2 cores — frkvr, Nov 14 '12 at 01:46
I just copy past you code and it didnt stop execution yet (20 minutes ago lol) — dreamcrash, Nov 14 '12 at 01:56
@dreamcrash Enter size of sequence A: 10000 Enter size of sequence B: 10000 Serial - Final answer: 4405 Time: 1.847498 Parallel - Final answer: 4405 Time: 7.258458 RUN SUCCESSFUL (total time: 23s) — frkvr, Nov 14 '12 at 03:02

Hristo Iliev · Accepted Answer · 2012-11-14T09:45:31.230

8

The problem is not with OpenMP but rather with the way you access the data in the parallel implementation. Even if you run the parallel version with one thread only, it is still about twice as slower.

Well, welcome to the world of non-cache-friendly data structures. Because of the diagonal dependence, you walk the matrix by diagonals but you still store it the usual way. The data access pattern is then strongly non-linear and hence very cache-unfriendly. Observe the amount of L1 and L2 cache misses when running your code in single-threaded mode on an old 16-core Xeon X7350 system:

L1 and L2 cache misses

The green portion of the process timeline represents the serial part of your code. The orange part is the (inactive because of the single-threaded execution) OpenMP parallel region. You can clearly see that the serial code is very cache friendly - not only the amount ot L2 cache misses is relatively low but the amount of L1 cache misses too. But in the parallel section of the code, because of the very large stride when walking the matrix diagonally, the cache is constantly being trashed and the amount of misses is sky high.

With two threads things get even worse. Elements from two adjacent diagonals that belong to the same matrix row are likely to fall into the same cache line. But one of the diagonals is processed by one thread and the other one by another thread. Hence your code runs into a huge amount of false sharing. Not to mention NUMA issues on modern multi-socket AMD64 or (post-)Nehalem systems.

The solution is to not simply walk the matrix by its diagonals but also to store the matrix in a skewed format, so that each diagonal occupies a continuous section in memory.

edited Nov 14 '12 at 09:45

answered Nov 14 '12 at 09:39

Hristo Iliev

72,659
12
135
186

that makes a lot of sense.thanks...i guess it is a problem inherent to the traditional wavefront pattern, although i can now see ways it can easily be improved – frkvr Nov 14 '12 at 11:46
just out of curiosity, would you know why assigning the processor to rows instead of columns doesn't improve the run-time?? ex. P1 always computes cell in the diagonal that belong to the same row... – frkvr Nov 17 '12 at 18:14
Could you show some code? I can't imagine the case that you have described. – Hristo Iliev Nov 17 '12 at 19:21
`int i = curr_diagonal; int j = 1; //column index for (; j <= curr_diagonal; j++, i--) { d = i % p_threads; d = d == 0 ? p_threads : d; if (i <= size_a && j <= size_b && (d == tid)) { if (sequence_a[i] == sequence_b[j]) { dp_table[i][j] = dp_table[i - 1][j - 1] + 1; } else if (dp_table[i - 1][j] >= dp_table[i][j - 1]) { dp_table[i][j] = dp_table[i - 1][j]; } else { dp_table[i][j] = dp_table[i][j - 1]; } } }` – frkvr Nov 18 '12 at 06:37
thread_ids are mapped to diagonals, and && (d == tid) restricts threads to cells in the same row. the code runs, ajust replace the corresponding segment in the parallel section – frkvr Nov 18 '12 at 06:44
Your inner loop still reads `j++, i--` so you are not accessing array elements in their storage order in memory, hence lots of cache misses. – Hristo Iliev Nov 18 '12 at 12:49
okay, i see that. but i am still trying to figure out why isn't there a difference in performance in L1 cache with the first code since cells processed by a core are located on the same row... btw what profiler are you using- I like the visual output? (I am using cachegrind) – frkvr Nov 18 '12 at 17:06
VampirTrace + PAPI for tracing with hardware performance counters and then Vampir to display and analyse the trace. Note that you start with ~10000 elements in the diagonal and each element falls in different cache line. This most certainly exceeds the size of the L1 cache. The original serial code walks cache lines sequentially and is thus much more cache friendly. – Hristo Iliev Nov 18 '12 at 17:19
thanks a lot, it makes sense..i am getting VampirTrace and PAPI now!...btw i've taken a different approach in the final program: dividing the table into diagonal tiles and processing the tiles with the serial code..i got very good speedups. – frkvr Nov 18 '12 at 17:26
Great! But keep in mind that Vampir (the GUI shown) is a commercially available tool, while the tracing library and PAPI are open-source projects. I guess you could also use [likwid](http://code.google.com/p/likwid/) or some other open-source tool to get the same information. – Hristo Iliev Nov 18 '12 at 21:00

why is this parallel function for computing the longest common subsequence slower than the serial?

1 Answers1