1

I am working on filling in missing data in a large (4GB) netcdf datafile (3 dimensions: time, longitude and latitude). The method is to fill in the masked values in data1 either with:

1) previous values from data1 or

2) with data from another (also masked dataset, data2) if the found value from data1 < the found value from data2.

So fare I have tried a couple of things, one is to make a very complex script with long for loops which never finished running after 24 hours. I have tried to reduce it, but i think it is still very much to complicated. I believe there is a much more simple procedure to do it than the way I am doing it now I just can't see how.

I have made a script where masked data is first replaced with zeroes in order to use the function np.where to get the index of my masked data (i did not find a function that returns the coordinates of masked data, so this is my work arround it). My problem is that my code is very long and i think time consuming for large datasets to run through. I believe there is a more simple way of doing it, but I haven't found another work arround it. Here is what I have so fare: : (the first part is just to generate some matrices that are easy to work with):

if __name__ == '__main__':
    import numpy as np
    import numpy.ma as ma
    from sortdata_helpers import decision_tree

    # Generating some (easy) test data to try the algorithm on:
    # data1
    rand1 = np.random.randint(10, size=(10, 10, 10))
    rand1 = ma.masked_where(rand1 > 5, rand1)
    rand1 = ma.filled(rand1, fill_value=0)
    rand1[0,:,:] = 1

    #data2
    rand2 = np.random.randint(10, size=(10, 10, 10))
    rand2[0, :, :] = 1
    coordinates1 = np.asarray(np.where(rand1 == 0)) # gives the locations of where in the data there are zeros

    filled_data = decision_tree(rand1, rand2, coordinates1)
    print(filled_data)

The functions that I defined to be called in the main script are these, in the same order as they are used:

def decision_tree(data1, data2, coordinates):
    # This is the main function,
    # where the decision between data1 or data2 is chosen.
    import numpy as np
    from sortdata_helpers import generate_vector
    from sortdata_helpers import find_value

    for i in range(coordinates.shape[1]):
        coordinate = [coordinates[0, i], coordinates[1,i], coordinates[2,i]]
        AET_vec = generate_vector(data1, coordinate) # makes vector to go back in time
        AET_value = find_value(AET_vec) # Takes the vector and find closest day with data
        PET_vec =  generate_vector(data2, coordinate)
        PET_value = find_value(PET_vec)
        if PET_value > AET_value:
            data1[coordinate[0], coordinate[1], coordinate[2]] = AET_value
        else:
            data1[coordinate[0], coordinate[1], coordinate[2]] = PET_value
    return(data1)


def generate_vector(data, coordinate):
    # This one generates the vector to go back in time.
    vector = data[0:coordinate[0], coordinate[1], coordinate[2]]
    return(vector)

def find_value(vector):
    # Here the fist value in vector that is not zero is chosen as "value"
    from itertools import dropwhile
    value = list(dropwhile(lambda x: x == 0, reversed(vector)))[0]
    return(value)

Hope someone has a good idea or suggestions on how to improve my code. I am still struggling with understanding indexing in python, and I think this can definately be done in a more smooth way than I have done here. Thanks for any suggestions or comments,

Idalh
  • 31
  • 3

0 Answers0