3

I need to create an array of arrays from the dataframe:

HR    sBP   dBP  T     ID
101   51    81   37.1  P1.1
102   52    82   37.2  P1.1
103   53    83   37.3  P1.1
104   54    84   37.4  P1.1
105   55    85   37.5  P1.1
210   65    90   36.1  P1.2
210   65    90   36.2  P1.2
210   65    90   36.3  P1.2
210   65    90   36.4  P1.2
210   65    90   36.5  P1.2
...
100   50    75   37    Pm.n
100   50    75   37    Pm.n
...
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6

where each chunk is a multivariate time series with HR, sBP, dBP and T° as variables, and the ID variable is the label for each subseries of data from each patient. The chunks for each patient are of variable length. I need to end up with an array like this:

array([[[101,    51,    81,    37.1],
        [102,    52,    82,    37.2],
        [103,    53,    83,    37.2],
        [104,    54,    84,    37.2],
        [105,    55,    85,    37.2]],

       [[210,    65,    90,    36.1],
        [210,    65,    90,    36.2],
        [210,    65,    90,    36.3],
        [210,    65,    90,    36.4],
        [210,    65,    90,    36.5]],

      ...

       [[100,    50,    60,    37.0], 
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0],  
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0]]])

With array.shape = (number of unique IDs, length of arrays, number of dimensions)

My code looks like this:

df_grp = df.groupby('ID')

for name, gp in df_grp:
    if name == 'P1.1':
        arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)  

    else:
        temp_arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)  
        arr = np.append(arr, temp_arr, axis=0)

But it gives me an array like this

array ([[101,    51,    81,    37.1],
        [102,    52,    82,    37.2],
        [103,    53,    83,    37.2],
        [104,    54,    84,    37.2],
        [105,    55,    85,    37.2],
        [210,    65,    90,    36.1],
        [210,    65,    90,    36.2],
        [210,    65,    90,    36.3],
        [210,    65,    90,    36.4],
        [210,    65,    90,    36.5]],

      ...

        [100,    50,    60,    37.0], 
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0],  
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0]])

With array.shape = (number of rows in df, number of dimensions). With or without reshape the result is the same, as well as with squeeze. I need the array in the aforementioned format so I can use it in the tslearn package for multivariate time series clustering. Any help is greatly appreciated.

Ehsan
  • 12,072
  • 2
  • 20
  • 33
Juan Weber
  • 33
  • 3

1 Answers1

2

I think you are looking for this:

arr = df.set_index('ID').groupby('ID').apply(pd.DataFrame.to_numpy).to_numpy()

Similar to your solution, first groupby and then use to_numpy to convert them to arrays. Please note that you cannot have non rectangular numpy arrays if your arrays have different shapes(i.e. different ID lengths). Therefore, this code returns an array of arrays you are looking for.

output:

[array([[101. ,  51. ,  81. ,  37.1],
        [102. ,  52. ,  82. ,  37.2],
        [103. ,  53. ,  83. ,  37.3],
        [104. ,  54. ,  84. ,  37.4],
        [105. ,  55. ,  85. ,  37.5]])
  array([[210. ,  65. ,  90. ,  36.1],
        [210. ,  65. ,  90. ,  36.2],
        [210. ,  65. ,  90. ,  36.3],
        [210. ,  65. ,  90. ,  36.4],
        [210. ,  65. ,  90. ,  36.5]])
 ...
  array([[100.,  50.,  75.,  37.],
        [100.,  50.,  75.,  37.]])
 ...
  array([[100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.]])]

If all 'ID's have same number of rows, you can stack the numpy array arr above to get a single array:

np.stack(arr)

[[[101.   51.   81.   37.1]
  [102.   52.   82.   37.2]
  [103.   53.   83.   37.3]
  [104.   54.   84.   37.4]
  [105.   55.   85.   37.5]]

 [[210.   65.   90.   36.1]
  [210.   65.   90.   36.2]
  [210.   65.   90.   36.3]
  [210.   65.   90.   36.4]
  [210.   65.   90.   36.5]]
...
 [[100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]]]
Ehsan
  • 12,072
  • 2
  • 20
  • 33
  • Thanks but I get the error `TypeError: 'deep' is an invalid keyword argument for copy()` from the `.apply(pd.DataFrame.to_numpy)` section of your code. And please bear in mind that the length of IDs can vary. – Juan Weber Jun 10 '20 at 12:06
  • @JuanWeber As I posted the output, it works on the data frame example you provided in the question. Are your pandas and numpy versions the most up-to-date? If not, please try updating them. If did not work, please provide more info of error so we can help better. Thank you. As for different length of IDs, the code handles it. That is why the line provides you with an array of arrays as explained. – Ehsan Jun 10 '20 at 12:09
  • Thanks, updating both pandas and numpy fixed the error. The resulting array has `shape = (number of unique IDs, )` and the array is accepted by tslearn. But I'm not sure if it is the expected behaviour. Do you think it's possible to get it as the shape I posted at the begining? `shape=(n° unique IDs, ??, n° of dimensions)` ? I don't know what size it would show in the length of the arrays. – Juan Weber Jun 10 '20 at 14:08