2

I have data files that contain data for many timesteps, with each timestep formatted in a block like this:

TIMESTEP  PARTICLES
0.00500103 1262
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
....

Each block consists of the 3 header lines and a number of lines of data related to the timestep (int on line 2). The number of lines of data associated with the block can vary from 0 to 10 Million. Each block may have a blank line between them, but sometimes this is missing.

I want to be able to read the file block by block, processing the data after reading the block - the files are large (often over 200GB) and one timestep is about all that can be comfortably loaded into memory.

Because of the file format I thought it would be quite easy to write a function that reads the 3 header lines, reads the actual data and then return a nice numpy array for data processing. I'm used to MATLAB where you can simply read in blocks while not at the end of file. I'm not quite sure how to do this with python.

I created the following function to read the block of data:

def readBlock(f):
    particleData = []
    Timestep = []
    numParticles = []
    linesProcessed = 0

    line = f.readline().strip()
    if line.startswith('TIMESTEP'): 

        timestepHeaders = line.strip()
        varData = f.readline().strip()
        headerStrings = f.readline().strip().split(' ')
        parts = varData.strip().split(' ')
        Timestep = float(parts[0])
        numParticles = int(parts[1])
        while linesProcessed < numParticles:
            particleData.append(tuple(f.readline().strip().split(' ')))
            linesProcessed += 1

        mydt = np.dtype([ ('ID',int), 
                     ('GROUP', int),
                     ('Vol', float),
                     ('Mass', float),
                     ('Px', float),
                     ('Py', float),
                     ('Pz', float),
                     ('Vx', float),
                     ('Vy', float),
                     ('Vz', float),
                     ] )

        particleData = np.array(particleData, dtype=mydt)

    return Timestep, numParticles, particleData

I try to run the function like this:

with open(fileOpenPath, 'r') as file:
    startWallTime = time.clock()

    Timestep, numParticles, particleData = readBlock(file)
    print(Timestep)

    ## Do processing stuff here 
    print("Timestep Processed")

    endWallTime = time.clock()

The problem is this only reads the first block of data from the file and stops there - I don't know how to make it loop through the file until it hits the end and stops.

Any suggestions on how to make this work would be great. I think I can write a way of doing it using single line processing with lots of if checks to see if i'm at the end of the timestep, but the simple function seemed easier and clearer.

jpmorr
  • 500
  • 5
  • 25
  • 1
    I still don't get what's your problem look like. In this code you only read one block. What happens when you try to read the next one? Also, I think that `pandas.read_csv(f, num_rows=X, sep=' ')` would make this function much better – Marat Dec 11 '16 at 22:20
  • That's the problem - it only reads one block - there are hundreds of timesteps in the file and I want it to keep returning one block until it reaches the end of the file. – jpmorr Dec 11 '16 at 22:36
  • what happens if you call readBlock on the same file again? – Marat Dec 11 '16 at 22:41
  • Looks like you are using Python 3. Is that correct? – Warren Weckesser Dec 11 '16 at 22:43
  • @marat as long as I account for the possible empty line, it reads the next timestep - as long as I didn't close the file. – jpmorr Dec 11 '16 at 22:47
  • @WarrenWeckesser I'm using python 3 – jpmorr Dec 11 '16 at 22:48
  • "Each block may have a blank line between them, *but sometimes this is missing.*" Grrrr... – Warren Weckesser Dec 11 '16 at 22:50
  • So sometimes there is a blank line before the line "TIMESTEP PARTICLES"? – Warren Weckesser Dec 11 '16 at 22:51
  • @WarrenWeckesser Yes, sometimes a blank line gets added before that line depending on the simulation solver used. Very annoying as you don't know whether it's included or not. – jpmorr Dec 11 '16 at 22:55

3 Answers3

2

You can use the max_rows argument of numpy.genfromtxt:

with open("timesteps.dat", "rb") as f:
    while True:
        line = f.readline()
        if len(line) == 0:
            # End of file
            break
        # Skip blank lines
        while len(line.strip()) == 0:
            line = f.readline()
        line2_fields = f.readline().split()
        timestep = float(line2_fields[0])
        particles = int(line2_fields[1])
        data = np.genfromtxt(f, names=True, dtype=None, max_rows=particles)

        print("Timestep:", timestep)
        print("Particles:", particles)
        print("Data:")
        print(data)
        print()

Here's a sample file:

TIMESTEP  PARTICLES
0.00500103    4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103    5
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
652 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
431 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
385 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
972 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

TIMESTEP  PARTICLES
0.00500103    3
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
222 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
333 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
444 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

And here is the output:

Timestep: 0.00500103
Particles: 4
Data:
[ (651, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (430, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (384, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 5
Data:
[ (971, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)
 (652, 0, 5.23599e-07, 0.000397935, -0.084626, -0.0347849, 0.00188164, 0, 0, -1.04903)
 (431, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (385, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (972, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]

Timestep: 0.00500103
Particles: 3
Data:
[ (222, 0, 5.23599e-07, 0.000397935, -0.0837742, -0.0442293, 0.0121046, 0, 0, -1.04903)
 (333, 0, 5.23599e-07, 0.000397935, -0.0749234, -0.0395652, 0.0143401, 0, 0, -1.04903)
 (444, 0, 5.23599e-07, 0.000397935, -0.0954931, -0.0159607, 0.0100155, 0, 0, -1.04903)]
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • Thanks. This is working almost perfectly now..... It's not dealing with timesteps where the number of particles, and hence rows, is zero. genfromtxt doesn't like max particles=0 so I stuck it in a "if greater than zero" check but that doesn't seem to help - it may have broken the empty line check. – jpmorr Dec 12 '16 at 09:50
0

The with does not loop, it will just make sure the file is properly closed afterwards.

To loop you'll need to add a while just after the with statement (see the code below). But before you can do that you'll need to check in the readBlock(f) function for an end of file (EOF). Replace line = f.readline().strip() with this code:

line = f.readline()
if not line:
    # EOF: returning None's.
    return None, None, None
# We do the strip after the check.
# Otherwise a blank line "\n" might be interpreted as EOF.
line = line.strip()

So adding the while loop in the with block and checking if we get None back indicating an EOF and so we can break out of the while loop:

with open('file1') as file_handle:
    while True:
        startWallTime = time.clock()

        Timestep, numParticles, particleData = readBlock(file_handle)
        if Timestep == None:
            break
        print(Timestep)

        ## Do processing stuff here 
        print("Timestep Processed")

        endWallTime = time.clock()
  • This seems to works reasonably well except for two things - the `return None, None, None` make the output quite confusing when there is a blank line between every block, but my main problem is that the values returned from the function disappear once the 'while True' loop ends – jpmorr Dec 12 '16 at 10:07
0

Here'a quick-n-dirty test (it worked on the 2nd try!)

import numpy as np

with open('stack41091659.txt','rb') as f:
    while f.readline():    # read the 'TIMESTEP  PARTICLES' line
        time, n = f.readline().strip().split()
        n = int(n)
        print(time, n)
        ablock = [f.readline()]  # block header line
        for i in range(n):
            ablock.append(f.readline())
        print(len(ablock))
        data = np.genfromtxt(ablock, dtype=None, names=True)
        print(data.shape, data.dtype)

test run:

1458:~/mypy$ python3 stack41091659.py 
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 3
4
(3,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 2
3
(2,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]
b'0.00500103' 4
5
(4,) [('ID', '<i4'), ('GROUP', '<i4'), ('VOLUME', '<f8'), ('MASS', '<f8'), ('PX', '<f8'), ('PY', '<f8'), ('PZ', '<f8'), ('VX', '<i4'), ('VY', '<i4'), ('VZ', '<f8')]

Sample file:

TIMESTEP  PARTICLES
0.00500103 4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 3
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 2
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903
TIMESTEP  PARTICLES
0.00500103 4
ID  GROUP  VOLUME  MASS  PX  PY  PZ  VX  VY  VZ
651 0 5.23599e-07 0.000397935 -0.084626 -0.0347849 0.00188164 0 0 -1.04903
430 0 5.23599e-07 0.000397935 -0.0837742 -0.0442293 0.0121046 0 0 -1.04903
384 0 5.23599e-07 0.000397935 -0.0749234 -0.0395652 0.0143401 0 0 -1.04903
971 0 5.23599e-07 0.000397935 -0.0954931 -0.0159607 0.0100155 0 0 -1.04903

I'm using the fact that genfromtxt is happy with anything that feeds it a block of lines. Here I collect the next block in a list, and pass it to genfromtxt.

And using the max_rows parameter of genfromtxt, I can tell it to read the next n rows directly:

with open('stack41091659.txt','rb') as f:
    while f.readline():
        time, n = f.readline().strip().split()
        n = int(n)
        print(time, n)
        data = np.genfromtxt(f, dtype=None, names=True, max_rows=n)
        print(data.shape, len(data.dtype.names))

I'm not taking into account that optional blank line. Probably could squeeze that in at the start of the block read. I.e. Readlines until I get one with the valid float int pair of strings.

hpaulj
  • 221,503
  • 14
  • 230
  • 353