I have a program that aggregates several .csv files into one. When I execute to add 3 files of the same structure (number and names of columns) it passes a message of success, informing that 3 files have been joined in a total of 1154341 lines. When I add another file of the same structure, the message is updated to four files and 1446553 lines. So far so good. When I use the pandas to read them (pd.read_csv (file.csv)), the two files have the same size, the smaller file size. When I analyze a single column, note the difference in indices of the two dataframes:
#Union of 3 .csv files
>>>df_reembolsos_1['ideCadastro']
0 NaN
1 NaN
2 NaN
...................
1154338 195997.0
1154339 195997.0
Name: ideCadastro, Length: 1154339, dtype: float64
# Union of 4 .csv file
>>> df_reembolsos_2['ideCadastro]
0 NaN
1 NaN
2 NaN
...................
1446550 195997
1446551 195997
Name: ideCadastro, Length: 1154339, dtype: object
It strikes me that in reading the first file the number of indexes is equal to the size, while in the second scenario the number of indexes is larger than the size. I have looked at the two files and they really are different and have the expected size for joining the different file numbers. One difference I notice is the following warning message when reading the larger file:
DtypeWarning: Columns (1,2,3,4,5,8,10,11,12,13,15,22,23,28) have mixed types.
When I read the smaller file, this message only refers to column number 1. So I wonder if this problem is a limitation of pandas or if it is some problem with the data, and how can I solve it.