Speed of process each chunk slows on each next chunk
I tried process chunks with Numpy.vectorize functions but is wasn't succesfull
def f(s):
try:
a = s
s = s.replace('\\',' ')
s = s.replace('=',':')
s = s.replace('true','1')
s = s.replace('false','0')
s = s.replace('}"','}')
s = s.replace('"{','{')
s = re.findall(r'my_reg',s)[0]
s = s[6:]
except:
s = 'error'
return s
df = read_csv('my_data',chunksize=700000)
columns = my_columns
for chunk in df:
chunk.columns = columns
chunk['my_col'].progress_apply(f)
chunk.to_csv('my_name',mode=a)
tqdm's progress apply shows me:
100%|███████████████████████████████████████████████| 700000/700000 [15:38<00:00, 745.61it/s]
100%|███████████████████████████████████████████████| 700000/700000 [42:13<00:00, 276.32it/s]
100%|███████████████████████████████████████████████| 700000/700000 [41:33<00:00, 280.75it/s]
100%|███████████████████████████████████████████████| 700000/700000 [46:43<00:00, 249.73it/s]
100%|███████████████████████████████████████████████| 700000/700000 [51:04<00:00, 216.10it/s]
and after some chunks:
100%|██████████████████████████████████████████████| 700000/700000 [2:42:07<00:00, 53.75it/s]