Speed of chunk process slows down with each chunk

Question

Speed of process each chunk slows on each next chunk

I tried process chunks with Numpy.vectorize functions but is wasn't succesfull

def f(s):

    try:
        a = s
        s = s.replace('\\',' ')
        s = s.replace('=',':')
        s = s.replace('true','1')
        s = s.replace('false','0')
        s = s.replace('}"','}')
        s = s.replace('"{','{')
        s = re.findall(r'my_reg',s)[0]
        s = s[6:]
    except:
        s = 'error'

    return s

df = read_csv('my_data',chunksize=700000)
columns = my_columns
for chunk in df:
    chunk.columns = columns
    chunk['my_col'].progress_apply(f)
    chunk.to_csv('my_name',mode=a)

tqdm's progress apply shows me:

100%|███████████████████████████████████████████████| 700000/700000 [15:38<00:00, 745.61it/s]

100%|███████████████████████████████████████████████| 700000/700000 [42:13<00:00, 276.32it/s]

100%|███████████████████████████████████████████████| 700000/700000 [41:33<00:00, 280.75it/s]

100%|███████████████████████████████████████████████| 700000/700000 [46:43<00:00, 249.73it/s]

100%|███████████████████████████████████████████████| 700000/700000 [51:04<00:00, 216.10it/s]

and after some chunks:

100%|██████████████████████████████████████████████| 700000/700000 [2:42:07<00:00, 53.75it/s]

I don't see where you are using `np.vectorize`. Then have you consider to use dask? — rpanai, Jun 27 '19 at 13:28
I used np.vectorize when tried to speed up this solution. Results was the same. — Old Flanigan, Jun 27 '19 at 13:57

Speed of chunk process slows down with each chunk

0 Answers0