6

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The easiest way to do this would be to simply read in the whole file as a list of arrays and use np.median(arrays, axis=0). However, I don't have much RAM to work with, so this is not a good option.

So, I tried to stack them pixel-by-pixel, as in I focus on one pixel position (i, j) at a time, then read in each array one by one, appending the value at the given position to a list. Once all the values for a certain position across all arrays are saved, I use np.median and then just have to save that value in a list -- which in the end will have the medians of each pixel position. In the end I can just reshape this to 600x600, and I'll be done. The code for this is below.

import pickle
import time
import numpy as np

filename = 'images.dat' #contains my 50,000 2D numpy arrays

def stack_by_pixel(i, j):
    pixels_at_position = []
    with open(filename, 'rb') as f:
        while True:
            try:
                # Gather pixels at a given position
                array = pickle.load(f)
                pixels_at_position.append(array[i][j])
            except EOFError:
                break
    # Stacking at position (median)
    stacked_at_position = np.median(np.array(pixels_at_position))
    return stacked_at_position

# Form whole stacked image
stacked = []
for i in range(600):
    for j in range(600):
        t1 = time.time()
        stacked.append(stack_by_pixel(i, j))
        t2 = time.time()
        print('Done with element %d, %d: %f seconds' % (i, j, (t2-t1)))

stacked_image = np.reshape(stacked, (600,600))

After seeing some of the time printouts, I realize that this is wildly inefficient. Each completion of a position (i, j) takes about 150 seconds or so, which is not surprising since it is reading about 50,000 arrays one by one. And given that there are 360,000 (i, j) positions in my large arrays, this is projected to take 22 months to finish! Obviously this isn't feasible. But I'm sort of at a loss, because there's not enough RAM available to read in the whole file. Or maybe I could save all the pixel positions at once (a separate list for each position) for the arrays as it opens them one by one, but wouldn't saving 360,000 lists (that are about 50,000 elements long) in Python use a lot of RAM as well?

Any suggestions are welcome for how I could make this run significantly faster without using a lot of RAM. Thanks!

curious_cosmo
  • 1,184
  • 1
  • 18
  • 36
  • Maybe I don't properly understand the problem, but isn't the median simply the middle value? So if you read all arrays one by one and use one array to keep the minimum for each pixel and one array to keep the maximum for each pixel, and then in the end compute the average of the two, you should have a good approximation of the median. So at any time only three arrays are in memory (the currently scanned array, the array of minimum values and the array of maximum values. In the end you replace the currently scanned array by the array of median values. Median = (minimum + maximum) / 2 (roughly) – Jacques de Hooge Aug 30 '18 at 17:11
  • 1
    That is an interesting idea, but I do need to calculate the median pixels exactly in my case, which won't necessarily be the average of the the minimum and maximum values at a given position. – curious_cosmo Aug 30 '18 at 17:15
  • 1
    In that case the above won't work. You may have enough memory to use subarrays of say 1000 pixels, instead of working per pixel. Some other strategies are discussed at https://stackoverflow.com/questions/3372006/incremental-median-computation-with-max-memory-efficiency – Jacques de Hooge Aug 30 '18 at 17:23
  • 1
    Do some preprocessing first to turn it into something you can easily load and then efficiently process using numpy. I'd try a scheme, where each file contains nth row from each image -- i.e. each file becomes an array with 600 columns and 50000 rows. You should be able to generate those iteratively, so memory shouldn't be much issue. Now process those files one at a time, and then stack the resulting median rows back into a full image. | Your main problem right now is python interpreter overhead. – Dan Mašek Aug 30 '18 at 18:34

2 Answers2

2

This is a perfect use case for numpy's memory mapped arrays. Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as

arr = np.load('filename', mmap_mode='r')

For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.

However, you can still loop over each index and calculate the median without running into memory issues.

[[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]

where 50,000 reflects the total number of arrays.

Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).

Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,

arr = np.lib.format.open_memmap(
    'filename',              # File to store in
    mode='w+',               # Specify to create the file and write to it
    dtype=float32,           # Change this to your data's type
    shape=(50000, 600, 600)  # Shape of resulting array
)

Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).

idx = 0
with open(filename, 'rb') as f:
    while True:
        try:
            arr[idx] = pickle.load(f)
            idx += 1
        except EOFError:
            break

Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.

*I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.

** If you believe strangers on the internet.

curious_cosmo
  • 1,184
  • 1
  • 18
  • 36
user2699
  • 2,927
  • 14
  • 31
  • Thank you -- it's too bad that `np.median` doesn't seem to work efficiently with memmapped arrays. Unfortunately I shouldn't test it on my own data if there's any risk of it using too much RAM, out of respect for others' processes that are running on the server. – curious_cosmo Aug 30 '18 at 20:54
  • @curious_cosmo, no, this solves the ram issue. `arr` is memmapped so assignments to it will be written to disk. – user2699 Aug 30 '18 at 20:55
  • If you don't want to risk it, once you have the data in this format it's quite easy to go get the median for each pixel in a manual loop (i.e. `[[median(arr[:, i,j]) for i in range(600)] for j in range(600)]`. – user2699 Aug 30 '18 at 20:58
  • Wouldn't that probably take about as long to run as my current script (with 360,000 loops)? Also, this seems to explain why RAM is still used with `memmap`: https://stackoverflow.com/questions/20713063/writing-into-a-numpy-memmap-still-loads-into-ram-memory – curious_cosmo Aug 30 '18 at 21:01
  • Nope, it will do 360,000 loops accessing 50,000 elements directly from disk. Your current code does 360,000 loops where each loop reads and unpacks 50,000 pickle files. – user2699 Aug 30 '18 at 21:05
  • That's a good point with memory usage. Caching can cause mmap to appear to be using quite a bit of ram. `median` is a special case, however. Unlike most numpy functions it's not implemented as a ufunc, and handles some things differently. – user2699 Aug 30 '18 at 21:08
  • I see, but I'm unsure as to why accessing 50,000 elements (don't these still count as loops too?) would be much quicker than reading 50,000 "lines" from my pickle file. I guess that makes sense if the bulk of the computation time per loop just comes from unpacking the pickle data. However, would your method not read in all 50,000 elements to memory, or does the `:` index ensure they are handled one by one? – curious_cosmo Aug 30 '18 at 21:12
  • `arr[:, i, j]` reads in pixel i, j for all images, resulting in a 50,000 element vector. So it handles only a single pixel value, but it's from every image. `stack_by_pixel` loads all pixel values for all 50,000 images (plus processing time to unpickle the files), to get a single pixel. – user2699 Aug 30 '18 at 21:20
  • If you want to test it, compare a single run of `stack_by_pixel` to loading a `npy` file with 50,000 elements from disk. That should give you an approximate improvement. – user2699 Aug 30 '18 at 21:21
  • Ok, that makes sense. I just tested your code for calculating the median on 3 arrays of shape (3,3) so it would be easy to analyze. One array had all ones, the other had all twos, and the last had all threes -- so the median should be a shape (3,3) array with all twos. But when I ran your code snippet on them replacing the `range(600)` instances with `range(3)`, it gave me: `[[1.0, 2.0, 3.0], [1.0, 2.0, 3.0], [1.0, 2.0, 3.0]]`. – curious_cosmo Aug 30 '18 at 21:32
  • I just submitted my edit to your question, but this corrected code worked for my 3 arrays of shape (3,3) example: `[[np.median([arr[k][i][j] for k in range(3)]) for i in range(3)] for j in range(3)]`, where k goes up to the number of arrays, and i and j span the dimensions. I do have another concern though -- would the entire 3D array be loaded into memory, in order to access the `[k][i][j]` element in the first place? – curious_cosmo Aug 30 '18 at 21:51
1

Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.


My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.

Preparation

In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.

This script will act as a benchmark that we can compare the preprocessing and processing stages against.

import numpy as np
import pickle
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

with open('data/raw_data.pickle', 'wb') as f:
    for i in range(FILE_COUNT):
        data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
        data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
        pickle.dump(data, f)
        print i,

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)

In a test run this script completed in 372 seconds, generating ~ 10 GB file.

Preprocessing

Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).

import numpy as np
import pickle
import time

# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

outfiles = []
for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    outfiles.append(open(outfilename, 'wb'))


with open('data/raw_data.pickle', 'rb') as f:
    for i in range(FILE_COUNT):
        data = pickle.load(f)
        for j in range(IMAGE_HEIGHT):
            data[j].tofile(outfiles[j])
        print i,

for i in range(IMAGE_HEIGHT):
    outfiles[i].close()

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)

In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.

Processing

Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.

Finally, use numpy.vstack to reassemble a median image.

import numpy as np
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600

t1 = time.time()

result_rows = []

for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
    median_row = np.median(data, axis=0)
    result_rows.append(median_row)
    print i,

result = np.vstack(result_rows)
print result

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)

In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.


Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K @ 3.4GHz, using 32bit Python on purpose.

Dan Mašek
  • 17,852
  • 6
  • 57
  • 85
  • Thank you, this is very detailed! I'm running 64bit Python, and I just created a test list containing 600 floats (representative of one row of one of my arrays), and used `sys.getsizeof` to see that 50,000 of such lists (or one of the files you mentioned for preprocessing) would use about 0.2 GB. This is fine if I'm just working with one at a time and then closing it, but in your preprocessing section you are opening all of them and then only closing them after you've written to them all. In my case, wouldn't that use 0.2*600 = 120 GB of RAM, the same usage is loading in my original file? – curious_cosmo Aug 30 '18 at 22:11
  • @curious_cosmo The preprocessing opens 600 file handles, but writes to them a row at a time. There will be some small buffer allocated for each (on the order of kilobytes or tens of), but anything else gets flushed to the disk once the buffers fill up, so overall the memory usage will still be small. – Dan Mašek Aug 30 '18 at 22:38
  • I thought this would work, but I'm having issues. I'm testing your code on three simple 2d arrays of shape (3,3): all ones, all twos, and all threes -- so the final result should be a (3,3) shape array with all twos. However, I think something is messing up at the `reshape` part in the processing section. If I print `data` right after that line in the loop, I'm getting arrays of shape (24, 3). And then my final `result` is just a (3,3) array of all zeros. Can you explain further what the `reshape(-1, width)` is supposed to do? – curious_cosmo Aug 31 '18 at 00:57
  • @curious_cosmo `fromfile` loads the array as single row. Reshape changes the shape so that it has `width` columns, and as many rows as possible (given the number of elements). | Given that you have 8x as many data elements, I suspect you're using 64bit floats, and forgot to change the `dtype` parameter of `fromfile`. | I've [updated the script](https://pastebin.com/MkNKavpk) (and combined the 3 bits together) to work with `np.float64` -- the only things really necessary were changing the `dtype` parameter. – Dan Mašek Aug 31 '18 at 14:06
  • Thank you. Why did changing the `dtype` change the numbers in the final result array all together? I was under the impression that it simply changed how many bytes are used by each element. In my 3x3 example, I was getting all zeros with `np.float64` and `np.float32` as well as your `int` example, but once I checked the type of one of my elements and changed it to `np.int32` I got the correct answer of all twos. – curious_cosmo Aug 31 '18 at 16:54
  • Not just how many bytes are used, but also their meaning. For example, take `int32` and `float32` -- both are 4 bytes, but the bit pattern that represents value of 1 when treated as an integer means a [completely different value](https://www.h-schmidt.net/FloatConverter/IEEE754.html) when treated as a floating point value. Since we're storing the preprocessed data in raw format (basically just dumping the memory contents to disk), we're responsible to make sure that we write and read exactly the same datatype. – Dan Mašek Aug 31 '18 at 17:53