I have a some large files (more than 30gb) with pieces of information which I need to do some calculations on, like averaging. The pieces I mention are the slices of file, and I know the beginning line numbers and count of following lines for each slice.
So I have a dictionary with keys as beginning line numbers and values as count of following rows, and I use this dictionary to loop through the file and get slices over it. for each slice, I create a table, make some conversions and averaging, create a new table an convert it into a dictionary. I use islice for slicing and pandas dataframe to create tables from each slice.
however, in time process is getting slower and slower, even the size of the slices are more or less the same. First 1k slices - processed in 1h Second 1k slices - processed in 4h Third 1k slices - processed in 8h Second 1k slices - processed in 17h And I am waiting for days to complete the processes.
Right now I am doing this on a windows 10 machine, 1tb SSD, 32 GB ram. Previously I also tried on a Linux machine (ubuntu 18.4) with 250gb SSD and 8gb ram + 8gb virtual ram. Both resulted more or less the same.
What I noticed in windows is, 17% of CPU and 11% of memory is being used, but disk usage is 100%. I do not fully know what disk usage means and how I can improve it.
As a part of the code I was also importing data into mongodb while working on Linux, and I thought maybe it was because of indexing in mongodb. but when I print the processing time and import time I noticed that almost all time is spent on processing, import takes few seconds.
Also to gain time, I am now doing the processing part on a stronger windows machine and writing the docs as txt files. I expect that writing on disk slows down the process a bit but txt file sizes are not more than 600kb.
Below is the piece of code, how I read the file:
with open(infile) as inp:
for i in range(0,len(seg_ids)):
inp.seek(0)
segment_slice = islice(inp,list(seg_ids.keys())[i], (list(seg_ids.keys())[i]+list(seg_ids.values())[i]+1))
segment = list(segment_slice)
for _, line in enumerate(segment[1:]):
#create dataframe and perform calculations
So I want to learn if there is a way to improve the processing time. I suppose my code reads whole file from beginning for each slice, and going through the end of the file reading time goes longer and longer.
As a note, because of the time constraints, I started with the most important slices I have to process first. So the rest will be more random slices on the files. So solution should be applicable for random slices, if there are any (I hope).
I am not experienced in scripting so please forgive me if I am asking a silly question, but I really could not find any answer.