2

I have a folder with 1000 numpy compressed files (npz) representing the results of a data simulation. Each file has two arrays a and b, with same dimension, shape, data type. What I want as a final output is the element-wise mean and standard deviation arrays of a, b and c(which I'm creating in the example below), taking into account all the simulation i.e.:

mean_a = np.mean(a1,a2,a3,...a1000)

std_a = np.std(a1,a2,a3...a1000) , etc.

I've managed to get the mean values, but not using direct element-wise operation. What I'm most struggling is getting the STD. I've tried to append all the arrays into lists, but I'm getting the problem of Memory Error. Any idea of how shall I proceed? See below what I've achieved so far. Thanks in advance!!

import glob
import numpy as np
import os 

simulation_runs = 10
simulation_range = np.arange(simulation_runs)

npFiles = [npFile for npFile in glob.iglob(os.path.join(outDir, "sc0*.npz"))]

a_accum = np.empty([885, 854], dtype=np.float32)
b_accum = np.empty([885, 854], dtype=np.float32)    
c_accum = np.empty([885, 854], dtype=np.float32)    

for run, i in enumerate(npFiles):
    npData = np.load(i)
    a = npData['scc'] 
    b = npData['bcc']
    c = a+b
    a_accum  = a + a_accum
    b_accum = b + b_accum   
    c_accum = c + b_accum   

aMean = a_accum/len(simulation_range)
bMean= b_accum/len(simulation_range)
cMean = c_accum/len(simulation_range)
renan-brso
  • 149
  • 3
  • 11
  • Since you're using `numpy`, have you investigated `np.mean()` and `np.std()` ? – kcw78 Jan 03 '19 at 20:37
  • Yes, I guess that`s the way it should be done, but I'm having this problem of how to put all the 1000 arrays together to use both np.std() and np.mean(). I tried to append each array to a list but it is getting memory error. So maybe there is a way to do that inside the loop or even another way. I'm very new on programming so I can't imagine other alternative =/ – renan-brso Jan 03 '19 at 20:42

1 Answers1

1

Firstly, if you have (ssh) access to a machine with more memory, that's easiest. Maybe you can even manage without one. 885*854*(1000 simulations)*(4 bytes per float32) = 2.8 GiB, so if you do a, b, and c separately, you should have enough memory on a reasonable machine. In that case, just put them into an array, and use np.mean and np.std:

a = np.zeros((1000,885,854), dtype=np.float32)
for run, i in enumerate(npFiles):
    a[i]=np.load(run)['scc']
amean = a.mean(axis=0)
astd = a.std(axis=0)

And similarly for b and c.

Otherwise, the most elegant option is to save the data in a format that can easily be lazily loaded. dask was specifically designed for this, but can take some time to learn (might be worth it in the long run though). You can also store it in netcat files and use xarray as a sort-of frontend for dask, maybe that's more convenient even.

If you only need the mean, std, you can do it manually. The formula for std is

std = sqrt(mean(abs(x - x.mean())**2))

So since you already have the means, the procedure will work very similar to what you already did: (untested)

import numpy as np
import os 

simulation_runs = 10
simulation_range = np.arange(simulation_runs)

npFiles = [npFile for npFile in glob.iglob(os.path.join(outDir, "sc0*.npz"))]

a_accum = np.empty([885, 854], dtype=np.float32)
b_accum = np.empty([885, 854], dtype=np.float32)    
c_accum = np.empty([885, 854], dtype=np.float32)    

for run, i in enumerate(npFiles):
    npData = np.load(i)
    a = npData['scc'] 
    b = npData['bcc']
    c = a+b
    a_accum  = a + a_accum
    b_accum = b + b_accum   
    c_accum = c + b_accum   

aMean = a_accum/len(simulation_range)
bMean= b_accum/len(simulation_range)
cMean = c_accum/len(simulation_range)


a_sumsq = np.empty([885, 854], dtype=np.float32)
b_sumsq = np.empty([885, 854], dtype=np.float32)    
c_sumsq = np.empty([885, 854], dtype=np.float32)    

for run, i in enumerate(npFiles):
    npData = np.load(i)
    a = npData['scc'] 
    b = npData['bcc']
    c = a+b
    a_sumsq += (a-aMean)**2
    b_sumsq += (b-bMean)**2
    c_sumsq += (c-cMean)**2

a_std = np.sqrt(a_sumsq/(len(npFiles)-1)) # The -1 is to get an unbiased estimator
b_std = np.sqrt(b_sumsq/(len(npFiles)-1))
c_std = np.sqrt(c_sumsq/(len(npFiles)-1))
Ewoud
  • 741
  • 4
  • 6
  • Thanks! I'm trying to make it work but I'm still just getting the mean. When the interpreter read line with a_sumsq += (a-aMean)**2... it is returning: ''FloatingPointError: invalid value encountered in add'''. Any idea how to solve it? – renan-brso Jan 08 '19 at 19:21
  • Are there any NaN's in your data? What if you try to print all the variables just before the error, is there something wrong there? – Ewoud Jan 15 '19 at 02:49
  • The NaN values show up when I start computing the std for each run. No Nan when computing the mean. About the 1st method...I had to to change the runs from 1000 to 10000, so I tried and came up with the memory error, unfortunately – renan-brso Jan 15 '19 at 11:35
  • I can't help you much figuring out where the NaNs come from without looking at the data. But if the first method worked, you can just take 10 chuncks of a 1000 files at once, fill them into some array, and calculate the root-mean-square of the stds of your chucks. That will be the same as the actual std. – Ewoud Jan 15 '19 at 17:41
  • Just an update: I`ve managed to get rid of NaNs by replacing numpy.empty with numpy.zeros (no idea why)... So I successfully used your 2nd solution! Thanks again – renan-brso Jan 18 '19 at 08:23