6

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.

As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated. Boolean_1.xlsx

Here's the code:

import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)

Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:

  Name stuff  Unnamed: 1 Unnamed: 2 Unnamed: 3
0         AFD          a        dsf        ads
1         DFA          1          2          3
2         DFD      123.3       41.1       13.7
3        IIOP        why        why        why
4         NaN        NaN        NaN        NaN
5         ZBA      False      False       True 

   Name adslfa  Unnamed: 1  Unnamed: 2  Unnamed: 3
0        asdf         6.0         3.0         6.0
1         NaN         NaN         NaN         NaN
2         NaN         NaN         NaN         NaN
3         NaN         NaN         NaN         NaN
4         NaN         NaN         NaN         NaN
5         ZBA         0.0         0.0         1.0

I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.

What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?

Colin
  • 65
  • 1
  • 5
  • do you know to columns that should be boolean before hand? you can specify the data type for a column on load with the `dtype` keyword – dan_g Aug 02 '18 at 20:53
  • Do the Boolean columns contain missing or blank cells? – ALollz Aug 02 '18 at 20:54
  • I know which rows are Boolean. Theoretically I could invert rows and columns but I don't know how to do that before the DataFrame is loaded in (the data being wrong at that point). Unfortunately due to the nature of the project I'm doing it needs to be able to handle empty cells, rows, and columns (although I don't understand why that should affect anything) – Colin Aug 02 '18 at 21:05
  • You might be able to dodge the issue by [transposing the data](https://stackoverflow.com/q/4869189/80610090) – Will Aug 02 '18 at 21:36

2 Answers2

2

Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.

However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': [True, False, True, True],
                   'B': [np.nan, np.nan, np.nan, False],
                   'C': [True, 'hello', np.nan, True]})

df = df.astype({'A': bool, 'B': float, 'C': object})

print(df)

       A    B      C
0   True  NaN   True
1  False  NaN  hello
2   True  NaN    NaN
3   True  0.0   True

Option 1: change "row dtype"

You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:

df.iloc[3] = df.iloc[3].astype(bool)

print(df)

       A      B      C
0   True    NaN   True
1  False    NaN  hello
2   True    NaN    NaN
3   True  False   True

print(df.dtypes)

A      bool
B    object
C    object
dtype: object

Option 2: transpose and cast to Boolean

In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.

df = df.T                   # transpose dataframe
df[3] = df[3].astype(bool)  # convert series to Boolean

print(df)

      0      1     2      3
A  True  False  True   True
B   NaN    NaN   NaN  False
C  True  hello   NaN   True

print(df.dtypes)

0    object
1    object
2    object
3      bool
dtype: object
jpp
  • 159,742
  • 34
  • 281
  • 339
1

Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.

In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False

In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

Padraig
  • 92
  • 2
  • 6
  • `determine the dtype for each column based on the first row in the column`. This is, in general, incorrect. Say in column A the first row has number 1 and the second row has number 1.5. Then column A will be read in as float instead of int. – jpp Sep 29 '20 at 17:10