1

Working on a 4D numpy array (array of arrays). Each nested array is of shape (1, 100, 4)

trainset.shape
(159984, 1, 100, 4)

But then within the nested arrays, are found some nan values which I would like to handle. For example the first nested array in trainset contains such:

trainset[0]
array([[[ 7.10669020e-02,  4.91383899e-03, -1.43700407e-02,
          1.52228864e-04],
        [ 7.59807410e-02, -9.45620170e-03,             nan,
          1.35892100e-04],
        [ 6.65245393e-02,             nan,             nan,
          8.98521456e-05],
        [            nan,             nan,             nan,
          1.41090006e-05],
        [            nan,             nan,             nan,
          6.68319391e-06],
        [            nan,             nan,             nan,
         -3.27272689e+01],
        [            nan,             nan,             nan,
         -1.09090911e+01],
        [            nan,             nan,             nan,
          8.25973981e+01],
        [            nan,             nan,             nan,
          1.12207785e+02],
        [            nan,             nan,             nan,
          1.65194797e+02],
        [            nan,             nan,             nan,
          2.25974015e+02],
        [            nan,             nan,             nan,
          2.78961026e+02],
        [ 3.87926649e-03,  1.81274134e-04, -1.08764481e-03,
          3.41298685e+02],
        ...
        [ 4.06054062e-03, -9.06370679e-04,  1.30517379e-03,
          3.10129855e+02]]])

How do I check all arrays in trainset for nan values and where found, replaces that with column's median value?

EDIT

Using:

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')

for data in trainset:
  trainsfrom_data = imp_mean.fit(trainset)

ValueError: Found array with dim 3. Estimator expected <= 2.

gives the indicated error, as above.

  • [`numpy.isnan`](https://numpy.org/doc/stable/reference/generated/numpy.isnan.html), not sure how to replace with median of row. – Tadhg McDonald-Jensen Jul 20 '20 at 12:34
  • maybe `trainset[np.isnan(trainset)] = np.median(trainset,axis=1)` but that would probably give axis error? I should probably learn how numpy works. – Tadhg McDonald-Jensen Jul 20 '20 at 12:38
  • @TadhgMcDonald-Jensen median of column. As suggested, I can use SimpleImputer, but I'm not sure how to reshape a 4D to 2D then later reshape back, preserving the order. –  Jul 20 '20 at 12:52

1 Answers1

0

The simplest way would be to use SimpleImputer, and select the median imputing strategy. I am not sure if nan are replaced column-wise or row-wise, you may have to reshape your array before passing it through the SimpleImputer(), and then reshape it back.

To your edit: reshape array into 2D, preserving column size, and then make a reshape to original form. Also, use fit_transform for every column to get the result in one go. Reshape will be something like this:

import numpy as np

A = np.random.rand(15, 1, 100, 4)
print(A.shape)

init_shape = A.shape

B = A.reshape(np.prod(init_shape[:-1]), init_shape[-1])
print(B.shape)

# SimpleImputer goes here

B = B.reshape(init_shape)
print(B.shape)
Aramakus
  • 1,910
  • 2
  • 11
  • 22
  • can you please add a code snippet showing how to reshape to 2D? Probably I do not need to navigate `trainset` like `for data in trainset:` I can go ahead to reshape `trainset` perform imputation and reshape it back. –  Jul 20 '20 at 12:47
  • 1
    Added some reshape example, hope it helps. – Aramakus Jul 20 '20 at 12:58