Equivalent of 1D numpy.interp in C++ Cuda (Lerp in CUDA)

Question

I have two vectors 'xp' and 'fp' which correspond to the x and y values respectively of the data. A third vector 'x' which is the x coordinates at which I would like to evaluate the interpolated values. My results in python using NumPy's interp function was as expected.

import numpy as np 

xp = np.array([1.0, 1.5, 2.0, 2.5, 3.5, 4.0, 4.5, 7.0, 8.0, 9.0, 10.0, 14.0, 17.0, 18.0, 20.0])
yp = xp**2
x = np.array([3,5])
np.interp(x,xp,yp) #array([  9.25,  26.  ])

My question is how do I replicate this algorithm inside a cuda kernel?

Here is my attempt:

size --> len(xp) == len(fp), x_size --> len(x).

__global__ void lerp_kernel(double* xp, double* yp, double* output_result, 
                          double* x, unsigned int size, unsigned int x_size)
{
  for( unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x ; idx < size ; idx += 
  blockDim.x*gridDim.x )

{     
 
 if(idx > size - 2){    
  idx = size - 2; 
  }
  double dx = xp[idx + 1] - xp[idx];
  double dy = yp[idx + 1] - yp[idx];
  double dydx = dy/dx ;
  double lerp_result = yp[idx] + (dydx * (x[idx] - xp[idx]));
  output_result[idx] = lerp_result;
  
   }
  }

I think one of the mistakes I am making is that I am not searching for the index range in xp that contains x (using something like numpy.searchsorted in python). I am not sure how to implement this part in CUDA. If someone knows a better way to do lerp in cuda, please let me know.

I have seen lerp functions in the documentation of Cg (https://developer.download.nvidia.com/cg/lerp.html), but these need a weight vector (fraction between 0-1) for the x vector. I am not sure how to rescale x so that I can use a weight vector to solve this problem.

there's [a whole blog article](https://developer.nvidia.com/blog/lerp-faster-cuda/) on it. and the lerp link you have provided is **not CUDA**, it is Cg. — Robert Crovella, Sep 23 '20 at 20:56
Oops you are right, it is Cg. Will update the post accordingly. I have seen that blog article before. If I understood it correctly, t needs to be a number between 0 and 1. How do I transform my x vector into that? Should I rescale 'x' using min(xp) and max(fp)? — Vishnu, Sep 23 '20 at 21:22

score 3 · Accepted Answer · answered Sep 24 '20 at 01:20

To mimic the behavior of numpy.interp will require several steps. We'll make at least one simplifying assumption: the numpy.interp function expects your xp array to be increasing (we could probably also say "sorted"). Otherwise it specifically mentions a need to do an (internal) sorting. We'll skip that case and assume that your xp array is increasing, as you have shown here.

The numpy function also allows the x array to be more-or-less arbitrary, from what I can see.

In order to do a proper interpolation, we must find the "segment" of xp that each x value belongs to. The only way I can think of is to perform a binary search. (also note that thrust has convenient binary searches)

The process then would be:

using a binary search per element in x, find the corresponding "segment" (i.e. endpoints) in xp
use the equation of a line (y=mx+b), based on the identified segment, to compute the interpolated value between the endpoints

Here's an example:

$ cat t40.cu
#include <iostream>

typedef int lt;

template <typename T>
__device__ void bsearch_range(const T *a, const T key, const lt len_a, lt *idx){
  lt lower = 0;
  lt upper = len_a;
  lt midpt;
  while (lower < upper){
    midpt = (lower + upper)>>1;
    if (a[midpt] < key) lower = midpt +1;
    else upper = midpt;
    }
  *idx = lower;
  return;
  }


template <typename T>
__global__ void my_interp(const T *xp, const T *yp, const lt xp_len, const lt x_len, const T *x, T *y){

  for (lt i = threadIdx.x+blockDim.x*blockIdx.x; i < x_len; i+=gridDim.x*blockDim.x){
    T val = x[i];
    if ((val >= xp[0]) && (val < xp[xp_len - 1])){
      lt idx;
      bsearch_range(xp, val, xp_len, &idx);
      T xlv = xp[idx-1];
      T xrv = xp[idx];
      T ylv = yp[idx-1];
      T yrv = yp[idx];
     // y  =      m                *   x       + b
      y[i] = ((yrv-ylv)/(xrv-xlv)) * (val-xlv) + ylv;
    }
  }
}

typedef float mt;
const int nTPB = 256;
int main(){

  mt xp[] = {1.0, 1.5, 2.0, 2.5, 3.5, 4.0, 4.5, 7.0, 8.0, 9.0, 10.0, 14.0, 17.0, 18.0, 20.0};
  lt xp_len = sizeof(xp)/sizeof(xp[0]);
  mt *yp = new mt[xp_len];
  for (lt i = 0; i < xp_len; i++) yp[i] = xp[i]*xp[i];
  mt x[] = {3,5};
  lt x_len = sizeof(x)/sizeof(x[0]);
  mt *y = new mt[x_len];
  mt *d_xp, *d_x, *d_yp, *d_y;
  cudaMalloc(&d_xp, xp_len*sizeof(xp[0]));
  cudaMalloc(&d_yp, xp_len*sizeof(yp[0]));
  cudaMalloc(&d_x,   x_len*sizeof( x[0]));
  cudaMalloc(&d_y,   x_len*sizeof( y[0]));
  cudaMemcpy(d_xp, xp, xp_len*sizeof(xp[0]), cudaMemcpyHostToDevice);
  cudaMemcpy(d_yp, yp, xp_len*sizeof(yp[0]), cudaMemcpyHostToDevice);
  cudaMemcpy(d_x, x, x_len*sizeof(x[0]), cudaMemcpyHostToDevice);
  my_interp<<<(x_len+nTPB-1)/nTPB, nTPB>>>(d_xp, d_yp, xp_len, x_len, d_x, d_y);
  cudaMemcpy(y, d_y, x_len*sizeof(y[0]), cudaMemcpyDeviceToHost);
  for (lt i = 0; i < x_len; i++) std::cout << y[i] << ",";
  std::cout << std::endl;
}
$ nvcc -o t40 t40.cu
$ cuda-memcheck ./t40
========= CUDA-MEMCHECK
9.25,26,
========= ERROR SUMMARY: 0 errors
$

I'm not suggesting the above code is defect-free or suitable for any particular purpose. My objective here is to demonstrate a method, not offer a fully tested code. In particular, edge cases probably need to be tested carefully (values at the edge of the overall range, or outside the range, for example).

Equivalent of 1D numpy.interp in C++ Cuda (Lerp in CUDA)

1 Answers1