How to speed up this search script?

Question

IHello, I have created a python script which aims to complete an excel file (wb) thanks to the first column of this file composed of many references (about 4000). To complete this excel, my script must search each reference (so use a for loop of list references from reading wb file) in two other excel files transformed into dataframe (df_mbom and df_ebom) and fill the specific cells of wb according to the presence or not of the references in df_mbom and df_ebom. If the reference is found, it is necessary to compare the level of the reference and the following line and fill wb accordingly. The created script works very well and it does the job very well.

But the only problem I have is that it takes more than 6 hours to search and fill wb for 1000 references so to process the 4000 references, it would take almost 24 hours! Do you have any suggestions to speed up this program?

Here is the code used:

from multiprocessing.dummy import Pool

def finding_complete(elt):
    elt = str(elt)
    pos = mylist_ref.index(elt)
    print(pos)
    item = r'^' + elt + '$'

    df_findings = df_mbom[df_mbom['Article'].str.contains(item, case=True, regex=True)]
    if df_findings.shape[0] == 0 :
        active_sheet.cell(row = 4+pos, column = 19).value = "NOK"
        active_sheet.cell(row = 4+pos, column = 18).value = "NOK"
    else :
        active_sheet.cell(row = 4+pos, column = 19).value = "OK"
        boolean_f = df_findings.drop_duplicates(subset = ['Article'],keep = 'first')
        ind = boolean_f.index.to_list()
        idx = ind[0]
        item1 = df_mbom['Niveau'][idx]
        item2 = df_mbom['Niveau'][idx + 1]
        if item2 > item1 :
            active_sheet.cell(row = 4+pos, column = 18).value = "OK"
        else :
            active_sheet.cell(row = 4+pos, column = 18).value = "NOK"
        

    df_findings2 = df_ebom[df_ebom['Article'].str.contains(item, case=True, regex=True)]
    pos  = mylist_ref.index(elt)
    if df_findings2.shape[0] == 0 :
        active_sheet.cell(row = 4+pos, column = 17).value = "NOK"
    else :
        boolean_f = df_findings2.drop_duplicates(subset = ['Article'],keep = 'first')
        ind = boolean_f.index.to_list()
        idx = ind[0]
        item1 = df_ebom['Niveau'][idx]
        item2 = df_ebom['Niveau'][idx + 1]
        if item2 > item1 :
            active_sheet.cell(row = 4+pos, column = 17).value = "OK"
        else :
            active_sheet.cell(row = 4+pos, column = 17).value = "NOK"
        

if __name__ == '__main__':
    start = time.time()
    path = '100446099_mbom.xlsx'
    df_mbom = pd.read_excel(path, sheet_name=0, header=0)

    path = '100446099_ebom.xlsx'
    df_ebom = pd.read_excel(path, sheet_name=0, header=0)

    location = 'DOC#6TERNORrev0.xlsx'
    wb = openpyxl.load_workbook(filename=location)  #, data_only=True"
    active_sheet = wb["DOC#6 toutes regions"]
            
    #Get cell value and put it in a list
    mylist_ref = []
    for row in active_sheet.iter_rows(min_row=4, max_row=active_sheet.max_row, min_col=2, max_col=2):
        for cell in row:
            if cell.value == None :
                pass
            else:
                mylist_ref.append(cell.value)
    print("Number of references :")
    print(len(mylist_ref))
    print(" ")

    with Pool() as pool: #os.cpu_count())
        pool.map(finding_complete,mylist_ref)    # correspond à for elt in mylist_ref: do finding_complete

    wb.save(location)
    wb.close()
    final = time.time()
    timer = final - start
    print(round(timer, 1))

Thanks in advance for your time.

score -1 · Answer 1 · answered Apr 19 '22 at 21:03

-1

convert the Excel file to json, procces the json, then write it to Excel.

answered Apr 19 '22 at 21:03

uingtea

6,002
2
26
40

How to speed up this search script?

1 Answers1