I am working on filling in missing data in a large (4GB) netcdf datafile (3 dimensions: time, longitude and latitude). The method is to fill in the masked values in data1 either with:
1) previous values from data1 or
2) with data from another (also masked dataset, data2) if the found value from data1 < the found value from data2.
So fare I have tried a couple of things, one is to make a very complex script with long for loops which never finished running after 24 hours. I have tried to reduce it, but i think it is still very much to complicated. I believe there is a much more simple procedure to do it than the way I am doing it now I just can't see how.
I have made a script where masked data is first replaced with zeroes in order to use the function np.where
to get the index of my masked data (i did not find a function that returns the coordinates of masked data, so this is my work arround it). My problem is that my code is very long and i think time consuming for large datasets to run through. I believe there is a more simple way of doing it, but I haven't found another work arround it.
Here is what I have so fare: : (the first part is just to generate some matrices that are easy to work with):
if __name__ == '__main__':
import numpy as np
import numpy.ma as ma
from sortdata_helpers import decision_tree
# Generating some (easy) test data to try the algorithm on:
# data1
rand1 = np.random.randint(10, size=(10, 10, 10))
rand1 = ma.masked_where(rand1 > 5, rand1)
rand1 = ma.filled(rand1, fill_value=0)
rand1[0,:,:] = 1
#data2
rand2 = np.random.randint(10, size=(10, 10, 10))
rand2[0, :, :] = 1
coordinates1 = np.asarray(np.where(rand1 == 0)) # gives the locations of where in the data there are zeros
filled_data = decision_tree(rand1, rand2, coordinates1)
print(filled_data)
The functions that I defined to be called in the main script are these, in the same order as they are used:
def decision_tree(data1, data2, coordinates):
# This is the main function,
# where the decision between data1 or data2 is chosen.
import numpy as np
from sortdata_helpers import generate_vector
from sortdata_helpers import find_value
for i in range(coordinates.shape[1]):
coordinate = [coordinates[0, i], coordinates[1,i], coordinates[2,i]]
AET_vec = generate_vector(data1, coordinate) # makes vector to go back in time
AET_value = find_value(AET_vec) # Takes the vector and find closest day with data
PET_vec = generate_vector(data2, coordinate)
PET_value = find_value(PET_vec)
if PET_value > AET_value:
data1[coordinate[0], coordinate[1], coordinate[2]] = AET_value
else:
data1[coordinate[0], coordinate[1], coordinate[2]] = PET_value
return(data1)
def generate_vector(data, coordinate):
# This one generates the vector to go back in time.
vector = data[0:coordinate[0], coordinate[1], coordinate[2]]
return(vector)
def find_value(vector):
# Here the fist value in vector that is not zero is chosen as "value"
from itertools import dropwhile
value = list(dropwhile(lambda x: x == 0, reversed(vector)))[0]
return(value)
Hope someone has a good idea or suggestions on how to improve my code. I am still struggling with understanding indexing in python, and I think this can definately be done in a more smooth way than I have done here. Thanks for any suggestions or comments,