We are trying to outer join several csv files, each containing two columns. The column we're joining on (NMI, which we make the index) is a unique identifier and mostly overlaps between each file (each file represents a day), but sometimes certain days will have more or less than others.
When we do the first join, everything looks normal, and each of the two daily columns have the same number of values as their .csv files.
file1_pri.count()
HCT 477358
NMI 477358
dtype: int64
file2_pri.count()
HCT 1548030
NMI 1548030
dtype: int64
file1_pri.set_index('NMI', inplace = True)
file2_pri.set_index('NMI', inplace = True)
file1_pri.count()
HCT 477358
dtype: int64
file2_pri.count()
HCT 1548030
dtype: int64
combined_file = file1_pri.join(file2_pri,
how = 'outer',
on = 'NMI',
lsuffix = '_day1',
rsuffix = '_day2')
combined_file.count()
NMI 1605303
HCT_day1 477594
HCT_day2 1548391
dtype: int64
combined_file.set_index('NMI', inplace = True)
combined_file.count()
HCT_day1 477594
HCT_day2 1548391
dtype: int64
However, after joining all of the csv files, two of them (the files with the least # of records, Day 1 & Day 8) cause memory issues because they have somehow grown to ~100x their actual size, and we're not sure why.
counter = 2
for file in csv_files.csv_filename[2:]:
file_df = pd.read_csv(file,
usecols=['NMI',
'HCT'])
file_pri.set_index('NMI', inplace = True)
print(file_pri.count())
combined_file = combined_file.join(file_pri,
how = 'outer',
on = 'NMI',
lsuffix = '',
rsuffix = '_day'+str(dates3[counter]))
counter += 1
combined_file.set_index('NMI', inplace = True)
combined_file.count()
HCT_day1 474331493
HCT_day2 1556243
HCT_day3 1583076
HCT_day4 1587760
HCT_day5 1488439
HCT_day6 1594631
HCT_day7 1466769
HCT_day8 474313590
HCT_day9 1570717
dtype: int64
We expected Day 1 to contain 477594 values instead of 474331493 and Day 8 to contain 462960 values instead of 474313590. The memory consumption every time the script joins these two csv files is near max of our PC's RAM at ~31.7 GB (99% of total RAM). We had originally been using merge instead of join, but that kept failing due to memory errors, so that's why we switched to join (to be more efficient).