how to extract overlapping sub-arrays with a window size and flatten them

Question

I am trying to get better at using numpy functions and methods to run my programs in python faster

I want to do the following:

I create an array 'a' as:

a=np.random.randint(-10,11,10000).reshape(-1,10)

a.shape: (1000,10)

I create another array which takes only the first two columns in array a

b=a[:,0:2]

b,shape: (1000,2)

now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'. So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array 'b' etc.

I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else

Thanks for your time and your help.

Ehsan · Accepted Answer · 2020-05-12T00:49:00.230

0

This loops over shifts rather than rows (loop of size 10):

N = 10
c = np.hstack([b[i:i-N] for i in range(N)])

Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).

c.shape: (990, 20)

Also I think you may be looking for a shape of (991, 20) if you want to include all windows.

you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:

from skimage.util.shape import view_as_windows
c = view_as_windows(b, (10,2)).reshape(-1, 20)

c.shape: (991, 20)

If you don't want the last row, simply remove it by calling c[:-1].
A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).

UPDATE: if you want to find unique values and their frequencies in each row of c you can do:

unique_values = []
unique_counts = []
for row in c:
  unique, unique_c = np.unique(row, return_counts=True)
  unique_values.append(unique)
  unique_counts.append(unique_c)

Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.

edited May 12 '20 at 00:49

answered May 11 '20 at 00:50

Ehsan

12,072
2
20
33

If I now wanted to find unique values and their frequency in each row of array c, how will you do it without for loops or do it much faster than for standard for loops that would iterate over the entire length of the array. Thanks for your help – Ghanshyam Bhat May 11 '20 at 21:05
@GhanshyamBhat you are welcome. If it solved your issue, please kindly go ahead and accept it so others find it helpful too. Your next question is a different question. I will add the answer to that under this solution. How would you like to have unique values and frequencies stored per each row? since it could not be rectangular(unique values per rows could be different). One way is to include all numbers and have frequency 0 for ones missing in each row. – Ehsan May 11 '20 at 21:08
I added a for loop version in the post, unless you want to define a way to convert your output into structured array, it is hard to do it without loops. Even doing it any other way should not be significantly faster than loop because different rows has to find uniques separately. – Ehsan May 11 '20 at 21:43
Thanks for your update. I would like to have a new array with unique values and their frequency one after the other in the same row for each row in c. I really appreciate you taking the time to answer my question. Thank you very much. BTW I would like to understand the logic of your answer for creating array c. It is not clear how the function hstack works on the entire array with only np.hstack([b[i:i-N] for i in range(N)]) applied with N=10. Thanks again – Ghanshyam Bhat May 11 '20 at 23:33
when I tried your answer for row in c: unique_values, unique_counts = np.unique(row, return_counts=True) it did not generate an array with rows containing the unique values and their frequencies. It just gave two arrays with unique values and frequencies only for one of the rows in c - not sure which row - perhaps the first or the last row – Ghanshyam Bhat May 11 '20 at 23:36
@GhanshyamBhat My pleasure. I will update the post with extra explanation. In the mean time https://stackoverflow.com/help/accepted-answer helps you get more familiar with SO and how to accept an answer. Welcome to SO. – Ehsan May 12 '20 at 00:38

how to extract overlapping sub-arrays with a window size and flatten them

1 Answers1