I have data files that contain data for many timesteps, with each timestep formatted in a block like this:
TIMESTEP PARTICLES
0.00500103 1262
ID GROUP VOLUME MASS PX PY PZ VX VY VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....
Each block consists of the 3 header lines and a number of lines of data related to the timestep (int on line 2). The number of lines of data associated with the block can vary from 0 to 10 Million. Each block may have a blank line between them, but sometimes this is missing.
I want to be able to read the file block by block, processing the data after reading the block - the files are large (often over 200GB) and one timestep is about all that can be comfortably loaded into memory.
Because of the file format I thought it would be quite easy to write a function that reads the 3 header lines, reads the actual data and then return a nice numpy array for data processing. I'm used to MATLAB where you can simply read in blocks while not at the end of file. I'm not quite sure how to do this with python.
I created the following function to read the block of data:
def readBlock(f):
particleData = []
Timestep = []
numParticles = []
linesProcessed = 0
line = f.readline().strip()
if line.startswith('TIMESTEP'):
timestepHeaders = line.strip()
varData = f.readline().strip()
headerStrings = f.readline().strip().split(' ')
parts = varData.strip().split(' ')
Timestep = float(parts[0])
numParticles = int(parts[1])
while linesProcessed < numParticles:
particleData.append(tuple(f.readline().strip().split(' ')))
linesProcessed += 1
mydt = np.dtype([ ('ID',int),
('GROUP', int),
('Vol', float),
('Mass', float),
('Px', float),
('Py', float),
('Pz', float),
('Vx', float),
('Vy', float),
('Vz', float),
] )
particleData = np.array(particleData, dtype=mydt)
return Timestep, numParticles, particleData
I try to run the function like this:
with open(fileOpenPath, 'r') as file:
startWallTime = time.clock()
Timestep, numParticles, particleData = readBlock(file)
print(Timestep)
## Do processing stuff here
print("Timestep Processed")
endWallTime = time.clock()
The problem is this only reads the first block of data from the file and stops there - I don't know how to make it loop through the file until it hits the end and stops.
Any suggestions on how to make this work would be great. I think I can write a way of doing it using single line processing with lots of if checks to see if i'm at the end of the timestep, but the simple function seemed easier and clearer.