1

I want to choose column based on dtype. Examples:

a = np.random.randn(10, 10).astype('float')
b = np.random.randn(10, 10).astype('uint8')
t=np.hstack((a,b))
t=pd.DataFrame(t)
uints = t.select_dtypes(include=['uint8']).columns.tolist()

The expected output from uints is: [10,11,12,13,14,15,16,17,18,19] The problem is when i join my original numpy data (a and b) together using hstack, dtype cannot be detected correctly as the code above returning [].

akilat90
  • 5,436
  • 7
  • 28
  • 42
Allan Tanaka
  • 297
  • 3
  • 11
  • Did you notice that `t=np.hstack((a,b))` creates an array with dtype float64? The values in `b` have been cast to float, to create a numpy array with a singe data type. – Warren Weckesser Nov 06 '17 at 04:42
  • An array, even when created with `hstack` has a uniform `dtype`. In this case float. A dataframe can have different dtypes for each column. Since your source arrays are 10 x 10 I don't think you want to go the structured array route. Or why not stick with a list? – hpaulj Nov 06 '17 at 07:24

1 Answers1

0

I think pandas can handle different data types better. Try this:

# Converting your arrays to dataframes
a = pd.DataFrame(np.random.randn(10, 10).astype('float'))
b = pd.DataFrame(np.random.randn(10, 10).astype('uint8'))

df = pd.concat([a,b],axis=1) # horizontally concatenating a and b
df.columns=[i for i in range(20)] # setting the column names manually
print(df.head())

             0         1         2         3         4         5         6   \
0  0.931404  0.612939 -0.369925 -0.777209  0.776831  1.923639  0.714632   
1  1.002620  0.612617 -0.184530 -0.279565 -0.021436  1.079653  0.299139   
2  0.938141  0.621674  1.723074  0.298568 -0.892739 -1.154118 -2.623486   
3 -1.050390 -1.058590  1.319297 -1.052302 -0.633126 -1.089275  0.796025   
4 -0.312114 -0.045124 -0.094495  0.296262  0.518496  0.068003 -1.247959   

         7         8         9   10   11   12   13   14  15   16   17   18  19  
0  0.710094 -1.465146 -0.009591   0  255    0  255    0   0    0    0    0   1  
1  1.645174 -0.491199  0.961290   0  253    0    1  254   1    0  255    0   0  
2  0.633076 -1.366998 -0.450123   0    1  255  255    0   0  255    0  254   0  
3 -0.650617  1.226741  1.884750   0  255    0    0    0   0  255    0    1   0  
4 -0.774224  0.780239 -1.072834   0  254    3    2    0   0    0    0    0   0  

df.select_dtypes(include=['uint8']).columns.tolist()

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
akilat90
  • 5,436
  • 7
  • 28
  • 42
  • Any there other way to preserve dtype after hstack without the need to use pd.DataFrame in order to select columns based on dtype? – Allan Tanaka Nov 06 '17 at 05:31
  • I guess then you have use record arrays as suggested in this [answer](https://stackoverflow.com/a/11310158/5864582). There should be manual work after the concatenation though. – akilat90 Nov 06 '17 at 05:39