OK given that you have to do pixel munging, let's look at your overall problem.
A medium image that is 30000x4000 pixels is 120M of image data for 8 bit gray and 240M of image data for 16 bit. So if you're looking at the data this way, you need to ask "is 30 minutes reasonable?" In order to do a 90 degree rotate, you are inducing a worst-case problem, memory-wise. You are touching every pixel in a single column in order to fill one row. If you work row-wise, at least you're not going to double the memory foot-print.
So - 120M of pixels means that you're doing 120M reads and 120M writes, or 240M data accesses. This means that you are processing roughly 66,667 pixels per second, which I think is too slow. I think you should be processing at least half a million pixels per second, probably way more.
If this were me, I'd run my profiling tools and see where the bottlenecks are and cut them out.
Without knowing your exact structure and having to guess, I would do the following:
Attempt to use one contiguous block of memory for the source image
I'd prefer to see a rotate function like this:
void RotateColumn(int column, char *sourceImage, int bytesPerRow, int bytesPerPixel, int height, char *destRow)
{
char *src = sourceImage + (bytesPerPixel * column);
if (bytesPerPixel == 1) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
src += bytesPerRow;
}
}
else if (bytesPerPixel == 2) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
*destRow++ = *(src + 1);
src += bytesPerRow;
// although I doubt it would be faster, you could try this:
// *destRow++ = *src++;
// *destRow++ = *src;
// src += bytesPerRow - 1;
}
}
else { /* error out */ }
}
I'm guessing that the inside of the loop will turn into maybe 8 instructions. On a 2GHz processor (let's say nominally 4 cycles per instruction, which is just a guess), you should be able to rotate 625 million pixels in a second. Roughly.
If you can't do contiguous, work on multiple dest scanlines at once.
If the source image is broken into blocks or you have a scanline abstraction of memory, what you do is get a scanline from the source image and rotate, say, a few dozen columns at once into a buffer of dest scanlines.
Let's assume that you have a mechanism for accessing scanlines abstractly, wherein you can acquire and release and write to scanlines.
Then what you're going to do is figure out how many source columns you're willing to process at once, because you're code will look something like this:
void RotateNColumns(Pixels &source, Pixels &dest, int startColumn, int nCols)
{
PixelRow &rows[nRows];
for (int i=0; i < nCols; i++)
rows[i] = dest.AcquireRow(i + startColumn);
for (int y=0; y < source.Height(); y++) {
PixelRow &srcRow = source.AcquireRow();
for (int i=0; i < nCols; i++) {
// CopyPixel(int srcX, PixelRow &destRow, int dstX, int nPixels);
sourceRow.CopyPixel(startColumn + i, rows[i], y, 1);
}
source.ReleaseRow(srcRow);
}
for (int i=0; i < nCols; i++)
dest.ReleaseAndWrite(rows[i]);
}
In this case, if you buffer up your source pixels in large-ish blocks of scanlines, you're not necessarily fragmenting your heap and you have the choice of possibly flushing decoded rows out to disk. You process n columns at a time and your memory locality should improve by a factor of n. Then it becomes a question of how expensive your caching is.
Can the problem be solved with parallel processing?
Honestly, I think your problem should be IO bound, not CPU bound. I'd think that your decoding time will dominate, but let's pretend it isn't, for grins.
Think about it this way - if you read the source image a whole row at a time, you could toss that decoded row to a thread that will write it into the appropriate column of the destination image. So write your decoder so that it has a method like OnRowDecoded(byte *row, int y, int width, int bytesPerPixel); And then you're rotating while you're decoding. OnRowDecoded() packs up the information and hands it to a thread that owns the dest image and writes the entire decoded row into the correct dest column. That thread does all the writing to the dest while the main thread is busy decoding the next row. Likely the worker thread will finish first, but maybe not.
You will need to make your SetPixel() to the dest be thread safe, but other than that, there's no reason this should be a serial task. In fact, if your source images use the TIFF feature of being divided up into bands or tiles, you can and should be decoding them in parallel.