3

I have 4D data in a data frame. I need to convert it to 3D Numpy array. I can do it with for-loops, but is there more efficient way?

# Data:
df = pd.DataFrame()
df['variable'] = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'A',
       'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D']
df['date'] = [101,102,103]*8
df['itemID'] = ['item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item2',
       'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2']
df['value1'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12]
df['value2'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12]
df['value3'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12]
df['value4'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12]

# Pivoting:
pivoted = df.pivot(index=['itemID', 'date'], columns='variable', values=[*df.columns[df.columns.str.startswith('value')]])
pivoted.index.levshape

Level shape is: (2, 3)

And it looks like this:

data after pivoting

# To Numpy:
pivoted2array = pivoted.to_numpy()
pivoted2array.shape

Shape is now: (6, 16)

# Reshaping to 3D:
pivoted2array3d = pivoted2array.reshape(*pivoted.index.levshape,-1)
pivoted2array3d.shape

Shape is now: (2, 3, 16)

And it looks like this:

array([[[ 1,  2,  3,  4,  1,  2,  3,  4,  1,  2,  3,  4,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  5,  6,  7,  8,  5,  6,  7,  8,  5,  6,  7,  8],
        [ 9, 10, 11, 12,  9, 10, 11, 12,  9, 10, 11, 12,  9, 10, 11, 12]],

       [[ 1,  2,  3,  4,  1,  2,  3,  4,  1,  2,  3,  4,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  5,  6,  7,  8,  5,  6,  7,  8,  5,  6,  7,  8],
        [ 9, 10, 11, 12,  9, 10, 11, 12,  9, 10, 11, 12,  9, 10, 11, 12]]])

And this is the troublesome part where I convert (reorder) the values with for-loops:

dimension3 = []
for k in range(pivoted2array3d.shape[0]): # unique items
    for j in range(pivoted2array3d.shape[1]):  # unique dates
        for i in range(pivoted2array3d.shape[2])[0:pivoted2array3d.shape[2]:4]: 
            element = pivoted2array3d[k][j][i] 
            dimension3.append(element)
        for l in range(pivoted2array3d.shape[2])[0+1:pivoted2array3d.shape[2]:4]: 
            element = pivoted2array3d[k][j][l] 
            dimension3.append(element)
        for m in range(pivoted2array3d.shape[2])[0+2:pivoted2array3d.shape[2]:4]: 
            element = pivoted2array3d[k][j][m] 
            dimension3.append(element)
        for n in range(pivoted2array3d.shape[2])[0+3:pivoted2array3d.shape[2]:4]: 
            element = pivoted2array3d[k][j][n] 
            dimension3.append(element)
len(dimension3)

As a result I have a list of length 96.

Then I reshape it back to 3D Numpy array:

final = np.array(dimension3).reshape(*pivoted2array3d.shape)
final.shape

It has again shape: (2, 3, 16)

And the final result looks like this as desired:

array([[[ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8],
        [ 9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12]],

       [[ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8],
        [ 9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12]]])

Is there computationally more elegant way to reorder my array? And is there a way to do less reshaping steps? I would really like to learn how to use Numpy operations!

My real data include thousands of items, hundreds of dates, tens of variables and value-variables.

Testing the proposed solutions

Thank you Shubham Sharma, Quang Hoang, and mathfux for your solutions. I made the initial data more complicated by adding one more date only for item1 and needed to fill the missing date for item2. The proposed solutions still work.

New data:

df = pd.DataFrame()
df['variable'] = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'A',
       'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'A', 'B', 'C', 'D']
df['date'] = [101,102,103]*8 + [104,104,104,104]
df['itemID'] = ['item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item1', 'item2',
       'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2', 'item2']
df['value1'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12,13,13,13,13]
df['value2'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12,13,13,13,13]
df['value3'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12,13,13,13,13]
df['value4'] = [1,5,9,2,6,10,3,7,11,4,8,12,1,5,9,2,6,10,3,7,11,4,8,12,13,13,13,13]

Pivoting and reindexing:

pivoted = df.pivot(index=['itemID', 'date'], columns='variable', values=[*df.columns[df.columns.str.startswith('value')]])
m = pd.MultiIndex.from_product([df['itemID'].unique(),df['date'].unique()], names=pivoted.index.names)
pt = pivoted.reindex(m, fill_value = 0)

Solution 1:

%%time
pt.sort_index(level=1, axis=1)\
       .values.reshape(*pivoted.index.levshape[:2], -1)

CPU times: user 895 µs, sys: 135 µs, total: 1.03 ms Wall time: 930 µs

Solution 2:

%%time
pt.stack(level=0).unstack().to_numpy().reshape(-1, df.date.nunique(), pt.shape[1])

CPU times: user 6.53 ms, sys: 1.62 ms, total: 8.15 ms Wall time: 6.58 ms

Solution 3:

%%time
pt.to_numpy().reshape(2,df.date.nunique(),4,4).swapaxes(2,3).reshape(2,df.date.nunique(),16)

CPU times: user 387 µs, sys: 24 µs, total: 411 µs Wall time: 397 µs

hermo
  • 95
  • 4

3 Answers3

2

It seems like np.swapaxes does the trick you need: arr.reshape(2,3,4,4).swapaxes(2,3).reshape(2,3,16)

The main idea is to swap the axes in the most inner data:

[ 1,  2,  3,  4,  1,  2,  3,  4,  1,  2,  3,  4,  1,  2,  3,  4] ->
[[ 1,  2,  3,  4],  [1,  2,  3,  4],  [1,  2,  3,  4],  [1,  2,  3,  4]] ->
[ 1,  1,  1,  1], [2,  2,  2,  2], [3,  3,  3,  3],  [4,  4,  4,  4]] ->
[ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  4,  4,  4,  4]
mathfux
  • 5,759
  • 1
  • 14
  • 34
2

It looks like you only need to swap the column level of pivoted:

a = df.pivot(index=['itemID','date'], columns=['variable']).stack(level=0).unstack()
a.to_numpy().reshape(-1, df.date.nunique(), a.shape[1])

Output:

array([[[ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8],
        [ 9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12]],

       [[ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8],
        [ 9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12]]])
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

We can try sorting the columns then reshape using the index levels

pivoted.sort_index(level=1, axis=1)\
       .values.reshape(*pivoted.index.levshape[:2], -1)

array([[[ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8],
        [ 9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12]],

       [[ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8],
        [ 9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12]]])
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53