0

I am struggling with improving the speed of interpolation of a large dataset which I am interpolating using gridfit. I have already posted a question on stackoverflow but havent got a response

So, I am thinking of trying something alternate. My idea is that if I have a huge dataset, as shown by the Python code snippet below

arr_len = 932826
xi = np.random.uniform(low=0, high=4496, size=arr_len)
yi = np.random.uniform(low=-74, high=492, size=arr_len)
zi = np.random.uniform(low=-30, high=97, size=arr_len)

I have to interpolate and get the values at defined points say (x, y). What could be the quickest way to find the 4 neighbouring points from the scattered data xi, yi and zi so that a bilinear interpolation could be performed, using interp2d (see image below). I dont know if this would give me faster results than using gridata, but would be nice to try it out

enter image description here

Masoom Kumar
  • 129
  • 1
  • 8
  • I think you could just pass those `xi`, `yi`, `zi` to `interp2d`? "x, y and z are arrays of values used to approximate some function f: z = f(x, y). This class returns a function whose call method uses spline interpolation to find the value of new points." – AKX May 25 '21 at 10:56
  • @AKX: I tried that, but interp2d cannot handle such a big dataset, so wanted to think of alternatives – Masoom Kumar May 25 '21 at 11:05
  • How do you define "the 4 neighbouring points"? Specifically, in your picture, if x1 is the maximum of all xi that are smaller than x, and y1 is the maximum of all yi that are smaller than y, x1 and y1 don't necessarily belong to the same point in your dataset, right? So you may not know the z-value of Q11. Or am I missing something? – Arne May 25 '21 at 11:07
  • Maybe what you want to do is essentially a nearest neighbors regression? If so, you could use scikit-learn: https://scikit-learn.org/stable/modules/neighbors.html#neighbors – Arne May 25 '21 at 11:18
  • @Arne : Yes, you are right, if we do the search as you have defined, then we might end up with discrete points which might not give us Q11. The way I was thinking was to divide the area into 4 quadrants, and then search all points in a quadrant. Then sort out the closest point based on distance from x,y to get a unique number. But then it felt to be computationally intensive too ! – Masoom Kumar May 25 '21 at 11:20
  • @Arne : Thank you for the suggestion with scikit, but I am unable to use it in the correct way to solve my purpose. Can you write a syntax for my use case, if possible, to show how I can use it on my example dataset xi, yi, zi, to interpolate value at an arbitary point x, y ? – Masoom Kumar May 25 '21 at 11:30

1 Answers1

1

I think what you have in mind is essentially nearest neighbors regression. Here's how you could do this with scikit-learn. Note that the number 4 of neighbors considered is an arbitrary choice, so you could also try other values.

import numpy as np
from sklearn.neighbors import KNeighborsRegressor

arr_len = 932826
np.random.seed(42)
xi = np.random.uniform(low=0, high=4496, size=arr_len)
yi = np.random.uniform(low=-74, high=492, size=arr_len)
zi = np.random.uniform(low=-30, high=97, size=arr_len)

# points to get z-values for (e.g.):
x_new = [100, 500, 2000]
y_new = [400, 300, 100]

# in machine learning notation:
X_train = np.vstack([xi, yi]).T
y_train = zi
X_predict = np.vstack([x_new, y_new]).T

# fit 4-nearest neighbors regressor to the training data
neigh = KNeighborsRegressor(n_neighbors=4)
neigh.fit(X_train, y_train)

# get "interpolated" z-values
print(neigh.predict(X_predict))
[39.37712018  4.36600728 47.00192216]
Arne
  • 9,990
  • 2
  • 18
  • 28
  • That you very much for the elegant approach. Really nice to learn a new way One question that I have is that in some areas there doesnt exist any data, but since it uses nearest, it anyways fills it with some data. How can I avoid it? For example if there does not exist any data within x+/-x1 and y+/-y1, then is there any way to reject calculating values at those points, and instead use NaN. I know I can do it manually, but was wondering if it can be handled directly in the code – Masoom Kumar May 25 '21 at 15:26
  • I also found that since my datasets are so dense, using gridfit with "nearest" option provides almost similar results, as I get with "linear" interpolation, and is also super fast since it just needs to take the nearest value – Masoom Kumar May 25 '21 at 15:28
  • Also, can you please let me know that when it checks for the nearest, does it look at data all around the probe point ? For example if we define data in four quadrants around a given point, will it look in all four quadrants before predicting a value, or does it just look at 4 nearest points, or x nearest points depending on how many we specify in neighbours – Masoom Kumar May 25 '21 at 15:36
  • 1
    To avoid predictions for empty regions, you could use the `RadiusNeighborsRegressor` instead of the `KNeighborsRegressor`. That will use all points within a specified radius for the prediction, and if there are none, it will output `NaN`. See: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsRegressor.html – Arne May 25 '21 at 19:40
  • 1
    As I wrote the code above, the 4 nearest neighbors are used for the prediction, where "nearest" means according to Euclidean distance. But you can also choose other metrics, by specifying the `metric` argument. See: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html – Arne May 25 '21 at 19:44