1

So I have a coupe of excel files totaling 1.8GB for now and is growing. All excel files have same columns and may have some overlapping rows with other files. Currently I have to read all files in memory (which is slow and soon I will not be able to because of PC RAM limitation). I am using following two methods but both are equally memory-inefficient and almost same:

  1. all_data = pd.concat(data_dict.values(), ignore_index=True)
  2. for df in data_dict.values(): all_data=pd.concat([all_data,df]).drop_duplicates().reset_index(drop=True)

So I was thinking is there a way I do not have to read all data in memory for comparison and ideally could limit the memory usage of pandas. Speed is not a big concern for me but memory is. I want my code stay relevant as the data keeps growing. So any suggestions?

  • 1
    Have you considered moving to DBMS? Something like PostgreSQL / SQL Server / Oracle is designed for this – Code Different Aug 18 '20 at 18:21
  • 2
    ^^^^^ Excel files totaling 1.8GB is a huge red flag ^^^^ – BigBen Aug 18 '20 at 18:24
  • 1
    Use [Spark/Pyspark](https://spark.apache.org/) that will partition your dataset for you when creating an RDD or dataframe. If you want to keep your Pandas code, use [Koalas](https://github.com/databricks/koalas), the distributed version of Pandas based on Spark. – user2314737 Aug 18 '20 at 20:45

2 Answers2

0

Seems like a big data problem. You'll probably have to setup a Spark cluster on your favorite cloud provider, for example something like Azure Databricks and do this there using Pyspark.

Helpful: Removing duplicates using Pyspark

Priyankar Kumar
  • 437
  • 3
  • 11
0

Here is a Python- and Excel-based approach that will run on your current machine. (I'm assuming that purchasing additional RAM, running on the cloud or using a database are not feasible.)

First, create a couple sample data frames for illustration. (If necessary, you could use Excel itself to convert .xlsx files to .csv, and then read each .csv file in chunks.)

import pandas as pd
df1 = pd.DataFrame({'a': [11, 12, 13], 'b': [14, 15, 16], 'c': [17, 18, 19]})
df2 = pd.DataFrame({'a': [13, 14, 15], 'b': [16, 17, 18], 'c': [19, 20, 21]})
print(pd.concat([df1, df2]))

    a   b   c
0  11  14  17
1  12  15  18
2  13  16  19  # copy 1
0  13  16  19  # copy 2 (duplicate)
1  14  17  20
2  15  18  21

Now iterate over every row of every data frame. Compute a hash value for each row. If you've seen the hash before, then you've seen row itself before, so just continue. If it's a new hash value, then export the row and add the hash value to your set.

cache = set()

for d in [df1, df2]:
    for row in d.itertuples():
        h = hash(row[1:])    # start at one to _exclude_ the index 
        if h in cache:
            continue
        print(row, h)
        cache.add(h)   

# you'll get this output, printed for illustration.
# it's easy to parse, for a variety of downstream uses
# note that the duplicate row was NOT exported

Pandas(Index=0, a=11, b=14, c=17) -8507403049323048181
Pandas(Index=1, a=12, b=15, c=18) -2181970144325339791
Pandas(Index=2, a=13, b=16, c=19) -6414432197316917400
Pandas(Index=1, a=14, b=17, c=20) -7535143174237721170
Pandas(Index=2, a=15, b=18, c=21) 4031466940829764140
jsmart
  • 2,921
  • 1
  • 6
  • 13
  • This worked best for me. Efficient and does not require code change or switching to koalas with which I was struggling since. –  Sep 19 '20 at 19:52
  • I'm not sure this is entirely correct (although it seems like a good, practical approach for most cases). A typical hash function produces a result that is probably unique for each input, but not necessarily. You could therefore remove rows that aren't actually duplicate. A more correct solution might save a pointer to n >= 1 prior rows that produces the same hash, and then compares the actual values before removal. (n > 1 above only if you've seen two different rows that produce the same hash) – L. Blanc Mar 26 '22 at 15:19