0

So I have this code that uses nested iterrows. I've read that iterrows is much slower than .apply or vectorization.

workbook = openpyxl.load_workbook(output)
worksheet = workbook['Sheet1']

for indexA, rowA in dfA.iterrows():
    nameA = rowA[0]
    for indexB, rowB in dfB.iterrows():
        nameB = rowB[14]
        if nameB.startswith(nameA):
            print(f"Found match : {nameB} starts with {nameA}")
            ws[f"A{indexA}"] = indexB[1]
            ws[f"B{indexA}"] = indexB[2]
            ws[f"C{indexA}"] = indexB[3]

wb.save(output)
wb.close()

I can't figure out how to use vectorization or apply on this part. Currently, with 500,000 rows in dfA, it's taking over 3 hours. I'm looking for anyway to speed this up. Thanks for your help!

  • This looks like it is slow because you have nested loops with N x O complexity. This should never be required when comparing two data frames. – Charlie Clark Jul 16 '21 at 12:50
  • Writing to cells in a loop is [slow in Excel VBA](https://stackoverflow.com/questions/30871703/slow-vba-macro-writing-in-cells) so is undoubtedly also slow using Python. The remedy would be to capture the values from the for loops in an array and write out the entire array up completion of the looping. – DarrylG Jul 17 '21 at 11:49
  • Yo thanks for the help you guys. I actually got it to work faster. Used pandarallel library and changed to code to use .apply instead of nested iterrows. Went from 5 hours to under 2 minutes – user9539627 Jul 18 '21 at 12:08

1 Answers1

0

I think first to find the rows which df2[14].startswith df1[0], then make a df, then use pd.save_excel maybe more efficient.

But i dont understand why you loop all df2 rows even if you already found match(just need last one ? ).

Jack
  • 56
  • 5
  • Thanks for your answer. That was dumb on my part. I fixed my code to break as soon as the first match is found. – user9539627 Jul 18 '21 at 12:12