reading csv file in python with pandas

Question

I have a program that aggregates several .csv files into one. When I execute to add 3 files of the same structure (number and names of columns) it passes a message of success, informing that 3 files have been joined in a total of 1154341 lines. When I add another file of the same structure, the message is updated to four files and 1446553 lines. So far so good. When I use the pandas to read them (pd.read_csv (file.csv)), the two files have the same size, the smaller file size. When I analyze a single column, note the difference in indices of the two dataframes:

 #Union of 3 .csv files
 >>>df_reembolsos_1['ideCadastro']
 0               NaN
 1               NaN
 2               NaN
 ...................
 1154338    195997.0
 1154339    195997.0
 Name: ideCadastro, Length: 1154339, dtype: float64



 # Union of 4 .csv file
 >>> df_reembolsos_2['ideCadastro]
 0               NaN
 1               NaN
 2               NaN
 ...................
 1446550    195997
 1446551    195997
 Name: ideCadastro, Length: 1154339, dtype: object

It strikes me that in reading the first file the number of indexes is equal to the size, while in the second scenario the number of indexes is larger than the size. I have looked at the two files and they really are different and have the expected size for joining the different file numbers. One difference I notice is the following warning message when reading the larger file:

DtypeWarning: Columns (1,2,3,4,5,8,10,11,12,13,15,22,23,28) have mixed types.

When I read the smaller file, this message only refers to column number 1. So I wonder if this problem is a limitation of pandas or if it is some problem with the data, and how can I solve it.

score 0 · Accepted Answer · answered May 28 '19 at 14:06

This excellent answer covers the DtypeWarning pretty thoroughly. Specify your Dtypes on read.

As for the indexes being of unexpected size:

When you append, you can ignore the index.
On read, you can use the .read_csv kwarg index_col -- If that is not properly being interpreted by pandas that could be a culprit.
Check that read_csv is really reading the correct number of rows, that there aren't a bunch of empty rows, et cetera.

reading csv file in python with pandas

1 Answers1