-1

We are trying to outer join several csv files, each containing two columns. The column we're joining on (NMI, which we make the index) is a unique identifier and mostly overlaps between each file (each file represents a day), but sometimes certain days will have more or less than others.

When we do the first join, everything looks normal, and each of the two daily columns have the same number of values as their .csv files.

file1_pri.count()

HCT     477358
NMI     477358
dtype: int64

file2_pri.count()

HCT     1548030
NMI     1548030
dtype: int64

file1_pri.set_index('NMI', inplace = True)
file2_pri.set_index('NMI', inplace = True)

file1_pri.count()

HCT     477358
dtype: int64

file2_pri.count()

HCT     1548030
dtype: int64

combined_file = file1_pri.join(file2_pri, 
                               how = 'outer', 
                               on = 'NMI',
                               lsuffix = '_day1',
                               rsuffix = '_day2')

combined_file.count()

NMI          1605303
HCT_day1     477594
HCT_day2     1548391
dtype: int64

combined_file.set_index('NMI', inplace = True)

combined_file.count()

HCT_day1     477594
HCT_day2     1548391
dtype: int64

However, after joining all of the csv files, two of them (the files with the least # of records, Day 1 & Day 8) cause memory issues because they have somehow grown to ~100x their actual size, and we're not sure why.

counter = 2

for file in csv_files.csv_filename[2:]:
    file_df = pd.read_csv(file, 
                          usecols=['NMI', 
                                   'HCT'])
    file_pri.set_index('NMI', inplace = True)
    print(file_pri.count())
    combined_file = combined_file.join(file_pri, 
                                       how = 'outer', 
                                       on = 'NMI',
                                       lsuffix = '',
                                       rsuffix = '_day'+str(dates3[counter]))
    counter += 1
    combined_file.set_index('NMI', inplace = True)

combined_file.count()

HCT_day1            474331493
HCT_day2              1556243
HCT_day3              1583076
HCT_day4              1587760
HCT_day5              1488439
HCT_day6              1594631
HCT_day7              1466769
HCT_day8            474313590
HCT_day9              1570717
dtype: int64

We expected Day 1 to contain 477594 values instead of 474331493 and Day 8 to contain 462960 values instead of 474313590. The memory consumption every time the script joins these two csv files is near max of our PC's RAM at ~31.7 GB (99% of total RAM). We had originally been using merge instead of join, but that kept failing due to memory errors, so that's why we switched to join (to be more efficient).

ExeGM
  • 1

1 Answers1

0

You are using outer join, which will "Use union of keys from both frames". That means that if there are key values in one DataFrame but not in the other the resulting joined df will have more rows than the sum of the two original dfs. The behavior is shown in the docs. You will need to scroll down to the join section. Here is a picture of what is happening. Also notice the NaN values that get filled in.

enter image description here

RK Replogle
  • 131
  • 1
  • 6
  • [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) [Why are images of text, code and mathematical expressions discouraged?](https://meta.stackexchange.com/q/320052/266284) – philipxy Apr 09 '23 at 01:48
  • Yes, we are aware of that, and that is how we are expecting it to work. The df rows increase as expected up until a point. The results we're seeing don't make sense because even if you add up all of the columns/files, you shouldn't see lengths as great as 474,331,493 because they aren't that long. Could you dive deeper into why two of our columns/files are seeing that response, but the others aren't? Could it be a memory leak for some reason that we're not thinking of? If so, how can we remediate the issue? – ExeGM Apr 11 '23 at 16:12