I'm trying to read in a massive dataframe and I keep getting the error: DtypeWarning: Columns (3,4,5,12,13,14,17,18,19,22,23,24) have mixed types. Specify dtype option on import or set low_memory=False.
Say that this is a VERY simplified version what my dataframe looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['alice', 'bob', np.nan, '--', 'jeff', np.nan],
'B': ['JFK', np.nan, 'JFK', 'JFK', 'JFK', 'JFK'],
'C': [.25, 0.5, np.nan, 4, 12.2, 14.4]})
A B C
0 alice JFK 0.25
1 bob NaN 0.50
2 NaN JFK NaN
3 -- JFK 4.00
4 jeff JFK 12.20
5 NaN JFK 14.40
It is my understanding the NaN is a float dtype.
In column A, what is the best way to represent blank values while maintaining the same dtype within the column? I would like to change '--' to be blank values as well. Similarly in column B, the NaN (float) mixes with JFK (string). What's the best way to solve this?
Eventually I want to do pd.read_csv( , dtype = {'A':str, 'B': str, 'C':np.int32}) or something like that. Correct me if I am wrong here as well.
Edit:
test = pd.read_csv('test.csv', na_values='--', dtype = {'A':str, 'B': str, 'C':np.float64})
in: test
out:
A B C
0 alice JFK 0.25
1 bob NaN 0.50
2 NaN JFK NaN
3 NaN JFK 4.00
4 jeff JFK 12.20
5 NaN JFK 14.40
type(test.iloc[2]['A']) # float
type(test.iloc[1]['A']) # string
Is it ok that these are different types? Is there a way to make both a string? Or is that not even recommended?