2

I am looking for the vectorized to create a numpy 2D array where each row contains 64 days of data extracted using a sliding window over a pandas series that has data for >6000 days.

The window size is 64 and the stride is 1.

Here are the solutions with a straight-forward loop and list-concatenation based on Ingrid's answer:

# Set up a dataframe with 6000 random samples
df = pd.DataFrame(np.random.rand(6000),columns=['d_ret'])
days_of_data = df['d_ret'].count()

n_D = 64   # Window size

# The dataset will have m = (days_of_data - n_D + 1) rows
m = days_of_data - n_D + 1

# Build the dataset with a loop
t = time.time()                                 # Start timing
X = np.zeros((m,n_D))                           # Initialize np array
for day in range(m):                            # Loop from 0 to (days_of_data - n_D + 1)
    X[day][:] = df['d_ret'][day:day+n_D].values # Copy content of sliding window into array  
elapsed = time.time() - t                       # Stop timing

print("X.shape\t: {}".format(X.shape))
print("Elapsed time\t: {}".format(elapsed))

t = time.time()                                 # Start timing
X1 = [df.loc[ind: ind+n_D-1, 'd_ret'].values for ind, _ in df.iterrows()]
X2 = [lst for lst in X1 if len(lst) == n_D]
X_np = np.array(X2)                             # Get np array as output
elapsed = time.time() - t                       # Stop timing

print("X_np.shape\t: {}".format(X_np.shape))
print("Elapsed time\t: {}".format(elapsed))

Output

X.shape : (5937, 64)
Elapsed time    : 0.37702155113220215
X_np.shape  : (5937, 64)
Elapsed time    : 0.7020401954650879

How do I vectorize this?

Example Input/Output

# Input
Input = pd.Series(range(128))

# Output
array([[  0.,   1.,   2., ...,  61.,  62.,  63.],
   [  1.,   2.,   3., ...,  62.,  63.,  64.],
   [  2.,   3.,   4., ...,  63.,  64.,  65.],
   ...,
   [ 62.,  63.,  64., ..., 123., 124., 125.],
   [ 63.,  64.,  65., ..., 124., 125., 126.],
   [ 64.,  65.,  66., ..., 125., 126., 127.]])

3 Answers3

0

You can use reshape

df.d_ret.values.reshape(-1, 64)
muzzyq
  • 904
  • 6
  • 14
  • 'DataFrame' object has no attribute 'a' and 'Series' object has no attribute 'a' – Fabio Regis Jun 27 '18 at 14:57
  • @FabioRegis Have updated my answer to include the column name `d_ret` you provided in your question. – muzzyq Jun 27 '18 at 23:28
  • This works out to be a stride of 64. E.g. if the original series had 128 values this would reshape (1,128) to (2,64). My loop reshapes to (65,128) duplicating data. – Fabio Regis Jun 28 '18 at 12:48
0

Maybe not vectorisation completely but list-concatination in python is really effective in comparison to a for loop.

Let say df is in the format

>>> df.head()
   d_ret
0      0
1      1
2      2
3      3
4      4

Couldn't you just do:

X = [df.d_ret[df.loc[ind: ind+n_D-1, 'd_ret']].values for ind, _ in df.iterrows()]

Then drop the last lists in X with length < n_D Add this + change in line above

X1 = [lst for lst in X if len(lst) == n_D]

Then I get for example:

>>> print X1[2]
[ 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65]

And np.array(X1).shape

>>> np.array(X1).shape
(937, 64)

937, 64 = 1000-64+1, 64 = df.count()-n_D+1, n_D

Let me know if this is what you're after :)

Ingrid
  • 78
  • 9
  • Your method produces a shape of `np.array(X).shape = (df.count(),)` when it should instead give the shape `(df.count()-n_D+1,n_D)`. Also it seems to me it is still using a loop with `for ind, _ in df.iterrows()`. – Fabio Regis Jun 29 '18 at 12:16
  • Thank you @FabioRegis for pointing that out. See updated answer, please :) Also, this is called list-concatenation and is much, much faster than a regular for-loop. :) – Ingrid Jun 29 '18 at 14:39
  • This method works BUT: I timed my loop and your method and got 0.4s for my loop and 11s for your solution. – Fabio Regis Jun 29 '18 at 17:38
  • Your code was giving me all nan. Correcting it by removing a couple of square brackets gives the correct result and 0.7s. See my updated question. So I think we are still after the fastest solution. – Fabio Regis Jun 29 '18 at 18:08
0

Fastest Vectorised solution from Numpy Vectorization of sliding-window operation Uses these key lines:

idx = np.arange(m)[:,None] + np.arange(n_D) 
out = df.values[idx].squeeze()

Applied here to my example:

# Set up a dataframe with 6000 random samples
df = pd.DataFrame(np.random.rand(6000),columns=['d_ret'])
days_of_data = df['d_ret'].count()

n_D = 64   # Window size

# The dataset will have m = (days_of_data - n_D + 1) rows
m = days_of_data - n_D + 1

t = time.time()                                 # Start timing
# This line creates and array of indices that is then used to access
# the df.values numpy array. I do not understand how this works...
idx = np.arange(m)[:,None] + np.arange(n_D)     # Don't understand this
out = df.values[idx].squeeze()                  # Remove an extra dimension
elapsed = time.time() - t                       # Stop timing

print("out.shape\t: {}".format(out.shape))
print("Elapsed time\t: {}".format(elapsed))

Output

out.shape   : (5937, 64)
Elapsed time    : 0.003000020980834961