Efficiently reading CSV with NaN matching dtypes

Question

I'm trying to read in a massive dataframe and I keep getting the error: DtypeWarning: Columns (3,4,5,12,13,14,17,18,19,22,23,24) have mixed types. Specify dtype option on import or set low_memory=False.

Say that this is a VERY simplified version what my dataframe looks like:

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['alice', 'bob', np.nan, '--', 'jeff', np.nan],
                   'B': ['JFK', np.nan, 'JFK', 'JFK', 'JFK', 'JFK'],
                   'C': [.25, 0.5, np.nan, 4, 12.2, 14.4]})


    A         B        C
0   alice   JFK     0.25
1   bob     NaN     0.50
2   NaN     JFK     NaN
3   --      JFK     4.00
4   jeff    JFK     12.20
5   NaN     JFK     14.40

It is my understanding the NaN is a float dtype.

In column A, what is the best way to represent blank values while maintaining the same dtype within the column? I would like to change '--' to be blank values as well. Similarly in column B, the NaN (float) mixes with JFK (string). What's the best way to solve this?

Eventually I want to do pd.read_csv( , dtype = {'A':str, 'B': str, 'C':np.int32}) or something like that. Correct me if I am wrong here as well.

Edit:

test = pd.read_csv('test.csv', na_values='--', dtype = {'A':str, 'B': str, 'C':np.float64})

in: test
out:    
    A        B       C
0   alice   JFK     0.25
1   bob     NaN     0.50
2   NaN     JFK     NaN
3   NaN     JFK     4.00
4   jeff    JFK     12.20
5   NaN     JFK     14.40


type(test.iloc[2]['A'])    # float
type(test.iloc[1]['A'])    # string

Is it ok that these are different types? Is there a way to make both a string? Or is that not even recommended?

Any symbol can represent a missing value in any column, as long as you tell `read_csv` which values are used for which columns. The choice is yours. — DYZ, Jul 09 '20 at 02:50
Yes, simply specify dtypes. That should remove the errors and is generally good practice. For example, for my job, I deal a lot with phone numbers. It’s critical for me to specify those as style str, or I run into various issues in my code with mixed dtypes. It depends, but I think it’s better usually to specify dtypes upfront, rather than use low memory = False. — David Erickson, Jul 09 '20 at 02:51
https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options — a11, Jul 09 '20 at 02:52
Technically it's not an *error*, it's a *warning*. API is just warning you that since columns have mixed data types, more memory will be required to load the data. If the memory is no concern then you can just add the suggested flag `low_memory=False` and the warning should go away. — NotAName, Jul 09 '20 at 02:55
@pavel “massive” is subjective, but if someone told me they have a massive data frame, then I wouldn’t recommend the low_memory=False option. — David Erickson, Jul 09 '20 at 02:56
@DavidErickson, from the opening post I assume that Danlo9 has already tried loading the dataframe, so if he/she didn't run into a Memory Error while doing it, setting the flag as True wouldn't be too bad. Although I agree that it shouldn't be used as a general rule. — NotAName, Jul 09 '20 at 03:04
@pavel It might be a warning, but I cannot access the df when I try to do df.head(). I think it just didn't work at all. — Danlo9, Jul 09 '20 at 03:25
@DavidErickson You're right. I will stop using `low_memory=False`. This df will eventually be ~99GB once I combine all the CSVs together. — Danlo9, Jul 09 '20 at 03:26
@DYZ How can I tell `read_csv` to recognize NaN and '--' to represent missing values? — Danlo9, Jul 09 '20 at 03:31

Efficiently reading CSV with NaN matching dtypes

0 Answers0