The pandas.dataframe.duplicated is great for finding duplicate rows across specified columns within a dataframe.
However, my dataset is larger than what fits in memory (and even larger than what I could fit in after extending it within reasonable budget limits).
This is fine for most of the analyses that I have to execute since I can loop over my dataset (csv and dbf files), loading each file into memory on its own and do everything in sequence. However, regarding duplicate analysis, this is apparently not suitable for finding duplicates across the whole dataset but only within single files.
Is there any algorithm or approach for finding duplicates across multiple dataframes while not having to load them all into memory at the same time?