-1

I'm writing a data pre-processor for machine learning, which needs to treat boolean data as categories and not try to see 1 as bigger than 0. After importing a csv table with Pandas DataFrame I want to determine columns which are boolean and cast them to boolean type, without iterating through all numeric columns to do so. Pandas intentionally interprets boolean columns as 'int64' and I haven't found any existing methods to solve this problem.

I've tried numpy array safe casting, but it fails, because instead of checking whether there are any values that don't fit into a boolean, it just refuses to downcast from any type:

import pandas as pd
df = pd.DataFrame({'a':[1, 0, 1]})    
numpy_array = df.values    
safe_booleans = numpy_array.astype(bool, casting='safe')

Cannot cast array from dtype('int64') to dtype('bool') according to the rule >'safe'

If I remove 'safe' casting, then it works, but I need 'safe' because there are non-boolean columns too which astype would otherwise turn into booleans with loss of data.

Much obliged if you could point me to my mistake or suggest other methods which would turn numeric columns/arrays with only boolean values into boolean type.

Robert Peetsalu
  • 126
  • 1
  • 6
  • 1
    *" I want to downcast safely in order to not turn any non-boolean features into booleans."* I'm confused. Are the values that you are trying to convert to boolean always 0 and 1? If not, what do you expect the result of casting to boolean to be? – Warren Weckesser Apr 30 '17 at 23:20
  • 2
    The note in the documentation for astype says: "Starting in NumPy 1.9, astype method now returns an error if the string dtype to cast to is not long enough in ‘safe’ casting mode to hold the max value of integer/float array that is being casted". That's pretty clear to me: bool can't properly hold values larger than 1, let alone the max value of an int64. –  Apr 30 '17 at 23:23
  • Let us assume that by boolean you mean 1 or 0, what do you want the behavior to be if they are not thus? – Stephen Rauch Apr 30 '17 at 23:28
  • Adding to my previous comment... If you know that the values in the array are always 0 and 1, you can drop the argument `casting='safe'` and use the default (`casting='unsafe'`), because there is no lost information in that case (i.e. there is nothing unsafe about it). – Warren Weckesser Apr 30 '17 at 23:44
  • @WarrenWeckesser If I remove 'safe' casting, then it works, but I need 'safe' because there are non-boolean columns too which astype would otherwise turn into booleans with loss of data. I want the code to accept both 0/1, Yes/No and True/False, but in this example even 0/1 doesn't work if casting safely. – Robert Peetsalu May 01 '17 at 07:58
  • @Evert but that's the problem - there aren't any values other than 0 and 1 which in python tongue are the boolean values. – Robert Peetsalu May 01 '17 at 08:28
  • The note specifically talks about the *maximum* value of the original *type*. Not the maximum (or minimum) value in the array itself, but what maximum value the array potentially can hold. Which is a lot larger than 1. That is why the cast fails. –  May 01 '17 at 10:55
  • If you are concerned about specific columns, can't you just loop over the columns that contain the 0's and 1's only, cast those, re-assign them to the dataframe (or perhaps a per-column cast can be safely done in Pandas), and you're done? I'd say it's even clearer, because the code becomes very explicit which columns it is changing to boolean. –  May 01 '17 at 10:57
  • @Evert Although astype() doc note speaks of "**string** dtype to cast to is not long enough", I guess they meant any dtype argument which can be given as a string, including bool dtype, which can't contain all possible int64-s. If so, then that explains why downcasting isn't compatible with safe casting, but the question remains - how to downcast from int to bool only those columns where there are only boolean values? – Robert Peetsalu May 01 '17 at 11:15
  • @Evert I corrected the question accordingly – Robert Peetsalu May 01 '17 at 11:36
  • @Evert P.S: Ignore Yes/No, just standard boolean values are enough – Robert Peetsalu May 01 '17 at 12:03

1 Answers1

0

For now I wrote an iteration to solve the problem:

import pandas as pd
table = pd.DataFrame( {'A':[1, 0, 1], 
                       'B':[1, 2, 3], 
                       'C':[True, True, False], 
                       'D':['a', 'b', 'c']} ) 
for column in range( table.shape[1] ):
    if table.iloc[:,column].isin( [0, 1] ).all():
        table.iloc[:,column] = table.iloc[:,column].astype( bool )
print( table.info() )

But I believe there shouldn't be need for this every time someone needs booleans to be of their own data type.

Robert Peetsalu
  • 126
  • 1
  • 6