Fast way to delete specific lines in a data file with Python?

Question

I'm working with dump files from simulations with the software Lammps, and the data files I get have nine lines of info for each timestep, which does not contain in data, but just informations. Therefore, I want to find a way to delete these lines, that are there for every timestep of data, s.t. I only have the data in a seperate file. Below I have shown the start of each timestep in the data, which I want deleted.

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
4200
ITEM: BOX BOUNDS pp pp pp
-2.0000000000000000e+01 2.0000000000000000e+01
-2.0000000000000000e+01 2.0000000000000000e+01
-2.0000000000000000e+01 2.0000000000000000e+01
ITEM: ATOMS id mol xu yu zu
533 26 -17.891 -16.7503 -18.8102
534 26 -17.7164 -17.5276 -18.7004
535 26 -17.3612 -17.7508 -19.2693
536 26 -17.0213 -17.8009 -18.5118
537 26 -17.8409 -18.5307 -18.8511
538 26 -17.7968 -19.5713 -18.6246
ITEM: TIMESTEP
1
ITEM: NUMBER OF ATOMS
4200
ITEM: BOX BOUNDS pp pp pp
-2.0000000000000000e+01 2.0000000000000000e+01
-2.0000000000000000e+01 2.0000000000000000e+01
-2.0000000000000000e+01 2.0000000000000000e+01
ITEM: ATOMS id mol xu yu zu
536 26 -17.0213 -17.8009 -18.5118
537 26 -17.8409 -18.5307 -18.8511
538 26 -17.7968 -19.5713 -18.6246

Which is continued for the number of timesteps i have run in the simulations. And the number of data points are also longer than shown.

Right now I have a code that does what I want, which can be seen below. However, I want to ask if anybody have some ideas or inputs to make it faster, since I am still a rather new Python user.


def data_process_func(filename, n_atoms, k):
    
    with open(filename, 'r') as f:
        lines = f.readlines()
    
    # The following loop deletes all the text only leaving data
    for i in range(len(timestep)):
        del lines[(n_atoms)*i:(n_atoms*i)+9]
        
    # Saves the data without the text to a txt file
    with open('data_{}.txt'.format(k), 'w') as f:
        f.writelines(lines)

    # Loads the data from the file into a dataframe
    data = pd.read_csv('data_{}.txt'.format(k), sep=" ", header = None, names = ['id', 'mol', 'xu', 'yu', 'zu'])
        
    return data

I think some of the indentation didn't get copied over when you pasted your code. — B Remmelzwaal, Feb 11 '23 at 21:37
Hmm, I think I got all the code pasted over I wanted, but I might miss some code for others to better understand it? — Morten jørgensen, Feb 11 '23 at 21:47
I mean that the first `with open()` does nothing unless the code up to the dataframe is at the same indentation level. Also the `writelines()` call is indented incorrectly. I assume this is fine whereever you wrote this, but make sure it's also correct here :) — B Remmelzwaal, Feb 11 '23 at 21:49
Are you really sure that your code does what you want? Deleting items from a list in a loop can have weird side-effects leading to unexpected results. — Claudio, Feb 11 '23 at 23:39

Claudio · Answer 1 · 2023-02-12T18:29:52.443

The major speed bottleneck of the approach you are using in your code is to operate on a list of lines deleting items from a list. Deleting items from a list is a time-consuming procedure.

Much better approach will be to loop over the list of lines writing the lines you need to keep directly to the output file like this:

# Saves the data without the text to a txt file
with open('data_{}.txt'.format(k), 'w') as f:
    for i in range(len(timestep)):
        f.writelines(lines[n_atoms*i+9:n_atoms*(i+1)])

In order to speed up writing to the file you can collect the lines you need to write out in another list and then write to the file in larger chunks of line amount or in one step writing out all of the lines:

# Saves the data without the text to a txt file
lines_to_keep = []
with open('data_{}.txt'.format(k), 'w') as f:
    for i in range(len(timestep)):
        lines_to_keep += lines[n_atoms*i+9:n_atoms*(i+1)]
    f.writelines(lines_to_keep)

You can also use a list comprehension instead of a loop:

# Saves the data without the text to a txt file
with open('data_{}.txt'.format(k), 'w') as f:
    f.writelines([lines[n_atoms*i+9:n_atoms*(i+1)] for i in range(len(timestep))])

and in order to save further time you can skip generating a list and take the items from a generator expression as follows:

with open('data_{}.txt'.format(k), 'w') as f:
    f.writelines(lines[n_atoms*i+9:n_atoms*(i+1)] for i in range(len(timestep)))

score 0 · Answer 2 · answered Feb 11 '23 at 22:04

0

You can skip writing the .csv altogether using StringIO:

import io

buffer = io.StringIO(lines)

data = pd.read_csv(filepath_or_buffer=buffer, sep=" ", header=None, names=['id', 'mol', 'xu', 'yu', 'zu'])

Sourced from this answer.

answered Feb 11 '23 at 22:04

B Remmelzwaal

1,581
2
4
11

Fast way to delete specific lines in a data file with Python?

2 Answers2