Defining dtypes at time of import of a tab-delimited file into a dataframe

Asked Sep 17 '18 at 12:07

Active Sep 17 '18 at 13:14

Viewed 59 times

As some data are ambiguous (e.g customer numbers that should be interpreted as strings and not integers), I am using the dtype option (pd.read_table('BSC.csv', dtype=str).

It works fine,as Pandas do not complain anymore about ambiguous types. Nevertheless, when I stored the dataframe in an HDFStore, I got a complaint that using untyped objects will result in performance loss. I looked at my dataframe using .dtypes, and I saw that all types moved back to 'object'.

I looked at Pandas.read_table doc, but I did not find any setting that would freeze the type to string after the import. Does it mean that the only option is to use a .apply(to_string) step just before storing the dataframe ?

edited Sep 17 '18 at 13:14

JohnE

29,156
8
79
109

asked Sep 17 '18 at 12:07

JCF

Loosely speaking, pandas has floats, integers, categoricals, and everything else (including strings) is an object. You can't do anything with strings (except possibly convert to categoricals) but you might want to run `pd.to_numeric` on the numerical columns before storing. – JohnE Sep 17 '18 at 12:28
Hi, John, this is what I did, but I still get the complaint from HDFStore that untyped data will result in performance losses. How should the DataFrame be processed before being sent to the data store ? – JCF Sep 17 '18 at 12:58
I dunno, I'm not an HDF expert, but probably you can just ignore the message. I think it is in effect just reminding you to not to store numbers as strings. – JohnE Sep 17 '18 at 13:14

Defining dtypes at time of import of a tab-delimited file into a dataframe

0 Answers0