Code Optimization on s3 read csv and ingest back to s3 bucket

Question

ddict = defaultdict(set)
    
file_str = query_csv_s3(s3, BUCKET_NAME, filename, sql_exp, use_header)
            #  read CSV to dataframe
            df = pd.read_csv(StringIO(file_str))
            fdf = df.drop_duplicates(subset='cleverTapId', keep='first')
            fdf.dropna(inplace=True)
            col_one_list = fdf['identity'].tolist()
            col_two_list = fdf['cleverTapId'].tolist()
            for k, v in zip(col_one_list, col_two_list):
                ddict[k].add(v)
            for imkey in ddict:
                im_length = len(str(imkey))
                if im_length == 9:
                    if len(ddict[imkey]) == 1:
                        for value in ddict[imkey]:
                            tdict = {imkey:value}
                        write_to_csv(FILE_NAME,tdict)
                    else:
                        ctlist = list(ddict[imkey])
                        snp_dict = {imkey:'|'.join(ctlist)}
                        write_to_csv(SNAP_FILE_NAME, snp_dict)
    
                elif im_length > 0:
                    if len(ddict[imkey]) == 1:
                        for value in ddict[imkey]:
                            fdict = {imkey:value}
                        write_to_csv(FRAUD_FILE_NAME,fdict)
                    else:
                        pass
                        # mult_ct = list(ddict[imkey])
                        # mydict = {imkey:','.join(mult_ct)}
                        # write_to_csv(MY_FILENAME,mydict)
                else:
                    pass

Here is write_to_csv:

def write_to_csv(filename,mdict):
    file_exists = os.path.isfile(filename)
    with open(filename,'a',newline='') as csvfile:
        headers = ['IM No', 'CT ID']
        writer = csv.DictWriter(
            csvfile,
            delimiter=',',
            lineterminator='\n',
            fieldnames=headers
        )
        if not file_exists:
            writer.writeheader()
        for key in mdict:
            writer.writerow({'IM No': key, 'CT ID': mdict[key]})

I'm reading a csv file containing 2 col using s3 select.

I'm generating 1 IM :1 CTID ,one to many and many to many file and uploading it back to an s3 bucket

How can I optimize it more because it's taking 18hrs to process 530 MB file size read from s3 and upload back?

You could start with printing out some intermediate timestamps to find out which part of your code takes the bulk of the time. — Maurice, Feb 17 '21 at 11:10
what is `ddict`? How does `write_to_csv` look like? What kind of compute platform are you running on? — Maurice, Feb 17 '21 at 11:54
@Maurice I have to read csv file line by line to map IM with CTID next col , ddict is final dictionary having unique IM as key and ctid as value, I am using window 10 8GB ram confg — Dharmendra Yadav, Feb 17 '21 at 12:03
You're using these abbreviations as if they'd be common knowledge, yet there is no context for them, please tell us more about what's going on an add the code to the question, there is an edit link below it. (The initialization/definition of ddict is missing, that's why I'm asking) — Maurice, Feb 17 '21 at 12:07

score 0 · Accepted Answer · answered Feb 17 '21 at 14:24

This is essentially a guess, because I can't run your code. The way you write data to your CSV files is extremely inefficient.

I/O operations to SSDs or Disks are among the more expensive operations in IT. Right now you open a file for each line you want to append, then append it and close the file again. That means for a 530 MB file you're probably doing millions of these expensive operations.

If you check out the performance tab in task manager you'll probably see a very high disk usage.

It's much more efficient to buffer a few of these (or all if RAM is big enough) in memory and flush them to disk at the end.

Roughly like this:

FRAUD_FILE_CONTENTS = []

# Computation stuff

FRAU_FILE_CONTENTS.append({"my": "dict"})

# After the loop

with open(FRAUD_FILE_NAME, "w"):
    # Write to CSV

Code Optimization on s3 read csv and ingest back to s3 bucket

1 Answers1