5

I am trying to load a data set of 1.000.000 images into memory. As standard numpy arrays (uint8) all images combined fill around 100 GB of RAM, but I need to get this down to < 50 GB while still being able to quickly read the images back into numpy (that's the whole point of keeping everything in memory). Lossless compression like blosc only reduces file size by around 10%, so I went to JPEG compression. Minimum example:

import io
from PIL import Image

numpy_array = (255 * np.random.rand(256, 256, 3)).astype(np.uint8)
image = Image.fromarray(numpy_array)
output = io.BytesIO()
image.save(output, format='JPEG')

At runtime I am reading the images with:

[np.array(Image.open(output)) for _ in range(1000)]

JPEG compression is very effective (< 10 GB), but the time it takes to read 1000 images back into numpy array is around 2.3 seconds, which seriously hurts the performance of my experiments. I am searching for suggestions that give a better trade-off between compression and read-speed.

user45893
  • 733
  • 5
  • 18
  • Could you clarify what you are trying to do please? You say you want to read 1,000,000 images - presumably from disk - yet your code generates random images so it doesn't seem to be representative? You say it takes 2.3s to read 1000 images, but I thought you had a million? You don't seem to mention any form of threading or `joblib`, yet that is generally one of the best ways of improving performance in these multi-core CPU times. Sorry, I just don't get it at the moment... – Mark Setchell Aug 03 '18 at 08:52
  • Dear @MarkSetchell, please excuse the confusion! Yes, the toy example I give is only for random images (to keep the example as short as possible), but in my experiment each numpy array is a natural image. Also, I am just reading 1000 images (instead of 1,000,000) just to simplify the timing. – user45893 Aug 03 '18 at 09:01
  • You are right about the multithreading, and I started to play around with asynchronous list comprehension. This can definitely put on top but is kind of orthogonal to the right compression / speed tradeoff. – user45893 Aug 03 '18 at 09:05
  • On my machine, at least, it is 5 times faster to make the array as uint8 up-front than to make it as float64 and scale down. I mean `image=np.random.randint(256,size=(256,256,3),dtype=np.uint8)` is 5 times faster than `image=(255*np.random.rand(256,256,3)).astype(np.uint8)` – Mark Setchell Aug 03 '18 at 09:35
  • I have done some test on this, and one thing that strikes me is that decoding JPEGs is rather slow. I then tried using colour-reduction instead of DCT as method for reducing your data size. I have no idea what your images look like, but I found I can compress iPhone images very well if I first reduce the colours to say 32 colours without dithering and then all my pixels will be one of 32 numbers and that compresses very well using `blosc` so maybe have a try that way and it should be able to save you as much space but hopefully decompress faster... I did my experiments using other tools. – Mark Setchell Aug 05 '18 at 21:28

1 Answers1

6

I am still not certain I understand what you are trying to do, but I created some dummy images and did some tests as follows. I'll show how I did that in case other folks feel like trying other methods and want a data set.

First, I created 1,000 images using GNU Parallel and ImageMagick like this:

parallel convert -depth 8 -size 256x256 xc:red +noise random -fill white -gravity center -pointsize 72 -annotate 0 "{}" -alpha off s_{}.png ::: {0..999}

That gives me 1,000 images called s_0.png through s_999.png and image 663 looks like this:

enter image description here

Then I did what I think you are trying to do - though it is hard to tell from your code:

#!/usr/local/bin/python3

import io
import time
import numpy as np
from PIL import Image

# Create BytesIO object
output = io.BytesIO()

# Load all 1,000 images and write into BytesIO object
for i in range(1000):
   name="s_{}.png".format(i)
   print("Opening image: {}".format(name))
   im = Image.open(name)
   im.save(output, format='JPEG',quality=50)
   nbytes = output.getbuffer().nbytes
   print("BytesIO size: {}".format(nbytes))

# Read back images from BytesIO ito list
start=time.clock()
l=[np.array(Image.open(output)) for _ in range(1000)]
diff=time.clock()-start
print("Time: {}".format(diff))

And that takes 2.4 seconds to read all 1,000 images from the BytesIO object and turn them into numpy arrays.

Then, I palettised the images by reducing to 256 colours (which I agree is lossy - just as your method) and saved a list of palettised image objects which I can readily later convert back to numpy arrays by simply calling:

np.array(ImageList[i].convert('RGB'))

Storing the data as a palettised image saves 66% of the space because you only store one byte of palette index per pixel rather than 3 bytes of RGB, so it is better than the 50% compression you seek.

#!/usr/local/bin/python3

import io
import time
import numpy as np
from PIL import Image

# Empty list of images
ImageList = []

# Load all 1,000 images 
for i in range(1000):
   name="s_{}.png".format(i)
   print("Opening image: {}".format(name))
   im = Image.open(name)
   # Add palettised image to list
   ImageList.append(im.quantize(colors=256, method=2))

# Read back images into numpy arrays
start=time.clock()
l=[np.array(ImageList[i].convert('RGB')) for i in range(1000)]
diff=time.clock()-start
print("Time: {}".format(diff))

# Quick test
# Image.fromarray(l[999]).save("result.png")

That now takes 0.2s instead of 2.4s - let's hope the loss of colour accuracy is acceptable to your unstated application :-)

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Dear @mark-setchel that's a great way to combine speed and memory efficiency! The loss of color accuracy is absolutely acceptable in my application (which is basically machine learning training of a computer vision model). – user45893 Aug 15 '18 at 13:33
  • Excellent - glad to be of help. Good luck with your project! – Mark Setchell Aug 15 '18 at 20:17