2

Im trying to downcast columns of a csv in the process of reading it, because doing it after reading the file is too time consuming. So far so good. The problem occurs of course if one column has NA values. Is there any possiblity to ignore that or to filter those in the process of reading maybe with that converter input of pandas read csv? And what does the argument 'verbose' do? The documentation says something about Indicate number of NA values placed in non-numeric columns.

My approach for downcasting so far is to read the first two rows and gues the dtype. I create a mapping dict for the dtype argument when reading the whole csv. Ofcourse NaN values can occur in the rows later on. So there is where mixed dtypes can occur:

import pandas as pd

df = pd.read_csv(filePath, delimiter=delimiter, nrows=2, low_memory=True, memory_map=True,engine='c')

if downcast == True:
    mapdtypes = {'int64': 'int8', 'float64': 'float32'}
    dtypes = list(df.dtypes.apply(str).replace(mapdtypes))
    dtype = {key: value for (key, value) in enumerate(dtypes)}
    df = pd.read_csv(filePath, delimiter=delimiter, memory_map=True,engine='c', low_memory=True, dtype=dtype)
Varlor
  • 1,421
  • 3
  • 22
  • 46

2 Answers2

0

Not sure if I properly understood your question but you are probably looking for the na_values argument, where you can specify one or multiple strings to be recognized as NaN values.

EDIT: Get the dtype from individual columns and save them to a dictionary for the down-casting. Again, you can limit the number of rows to be read into df, if you need to.

import csv

# get only the column headers from the csv:
with open(filePath, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames

# iterate through each column to get the dtype:
dtypes = {}
for f in fieldnames:
    df = pd.read_csv(filePath, usecols=[f], nrows=1000)
    dtypes.update({f:str(df.iloc[:,0].dtypes)})
Runkles
  • 113
  • 8
  • Unfortunatley not. The problem is that NaN values are in my dataframe. When checking the first two rows of my dataframe to get the dtypes of course other rows could contain NaN values in the columns. So the columns are mixed type. Like string columns with NaN values. And the whole column can not be cast to string. If the first row column has a NaN value the dtype is float. But later on the whole column only has strings. Strings can not be converted to float and so on – Varlor Feb 27 '19 at 14:44
  • Can you increase the number of rows to get the dtypes? If not, then i suggest to read the csv column by column, get the ` dtype` of the individual columns and save them to a dictionary, which you can use later for the down-casting - please, have a look at the edited answer. – Runkles Feb 27 '19 at 15:43
0

The original question relates to this one, so answering with similar info. The Pandas v1.0+ "Integer Array" data types enable what you ask. Use capitalized versions of the types such as 'Int16' etc. Missing values are recognized by Pandas .isnull(). Here is an example. Note the capital 'I' in the Pandas-specific Int16 data type (Pandas Documentation).

import pandas as pd
import numpy as np

dftemp = pd.DataFrame({'int_col':[4,np.nan,3,1],
                      'float_col':[0.0,1.0,np.nan,4.5]})

#Write to CSV (to be read back in to fully simulate CSV behavior with missing values etc.)
dftemp.to_csv('MixedTypes.csv', index=False)

lst_cols = ['int_col','float_col']
lst_dtypes = ['Int16','float']
dict_types = dict(zip(lst_cols,lst_dtypes))

#Unoptimized DataFrame    
df = pd.read_csv('MixedTypes.csv')
df

Result:

   int_col  float_col
0      4.0        0.0
1      NaN        1.0
2      3.0        NaN
3      1.0        4.5

Repeat with assignment of variable types --including Int16 for int_col

df2 = pd.read_csv('Data.csv', dtype=dict_types)
print(df2)


   int_col  float_col
0        4        0.0
1     <NA>        1.0
2        3        NaN
3        1        4.5
jdland
  • 44
  • 1
  • 1
  • 5