How to split a csv file into multiple csv based on a given criterion?

Question

I need to split few csv files based on a given time. In these files the time values are in seconds and given in 'Time' column.

For example, if I want to split aaa.csv file in 0.1 seconds, then the first set of rows with time 0.0 to 0.1 (No 1 to 8 in attached file) needs to get written into aaa1.csv, then the rows with time greater than 0.1 to 0.2 (No. 9 to 21 in attached file) to aaa2.csv so on...(basically multiples of the given time).

Output files needs to get the same name as input file along with a number at the end. And output files need to get written into a different location/folder. Time value need to be a variable. So at a time I can split in 0.1 sec and at another time I can split the file in 0.7sec so on.

How can I write a python script for this please? The file looks like the following (entire 119K file can be downloaded from https://fil.email/vnsZsp7b):

No.,Time,Length
1,0,146
2,0.006752,116
3,0.019767,156
4,0.039635,144
5,0.06009,147
6,0.069165,138
7,0.0797,133
8,0.099397,135
9,0.120142,135
10,0.139721,148
11,0.1401,126
12,0.1401,120
13,0.140101,123
14,0.140101,120
15,0.141294,118
16,0.141295,118
17,0.141295,114
18,0.144909,118
19,0.160639,119
20,0.161214,152
21,0.185625,143
... etc

AFTER @Serafeim 's answer, I tried this:

import pandas as pd
import numpy as np
import glob
import os

path = '/root/Desktop/TT1/'
mystep = 0.4


for filename in glob(os.path.join(path, '*.csv')):
    df = pd.read_csv(filename)
    def data_splitter(df):
        max_time = df['Time'].max() # get max value of Time for the current csv file (df)
        myrange= np.arange(0, max_time, mystep) # build the threshold range
        for k in range(len(myrange)):
            # build the upper values
            temp = df[(df['Time'] >= myrange[k]) & (df['Time'] < myrange[k] + mystep)]
            #temp.to_csv("/root/Desktop/T1/xx_{}.csv".format(k))
            temp.to_csv("/root/Desktop/T1/{}_{}.csv".format(filename, k))

data_splitter(df)

@MarsNebulaSoup I know how to do this using editcap tool in wireshark. But it doesn't support values less than 1 second. I am new to programming, so I am not sure how to do this.Im really sorry — Hasa, Jul 18 '19 at 15:39
@Hasa it does since 4 days ago, if you want to try compiling a more recent version from git. https://github.com/wireshark/wireshark/commit/8147af211d2eadbc32b3039baabc2f6b74308217 — Boris Verkhovskiy, Jul 18 '19 at 15:50
Possible duplicate of [Split wireshark to miliseconds](https://stackoverflow.com/questions/57004719/split-wireshark-to-miliseconds) — Boris Verkhovskiy, Jul 18 '19 at 15:52
Thanks @Boris but when the files get split. 'time delta from previous displayed frame' gets assigned to 0 in each split files first row. even if the original file has different values. thats why thought of split csv files instead of .pcap files. — Hasa, Jul 18 '19 at 15:57
@serafeim just saw your comment here. Thanks for your code, but thats not the output I expected, I posted in the comments below to the answer. Greatly appreciate if you could help, Much thanks — Hasa, Jul 22 '19 at 13:14

seralouk · Accepted Answer · 2019-07-23T13:19:38.067

You just need to apply a logical operation on the dataframe using pandas. ✔️

At the end of this answer I have a "script idea" to do this automatically but first let's go Step by step:

# Load the files using pandas
import pandas as pd

df = pd.read_csv("/Users/serafeim/Downloads/Testfile.csv")

# Get the desired elements based on 'Time' column
mask = df['Time'] < 0.1

# Write the new file
df_1 = df[mask] # or directly use: df_1 = df[df['Time'] < 0.1]

# save it 
df_1.to_csv("Testfile1.csv")

print(df_1)
    No.      Time  Length
0    1  0.000000     146
1    2  0.006752     116
2    3  0.019767     156
3    4  0.039635     144
4    5  0.060090     147
5    6  0.069165     138
6    7  0.079700     133
7    8  0.099397     135

#For 0.1 to 0.2 applying 2 logical conditions
df_2 = df[(df['Time'] > 0.1) & (df['Time'] < 0.2)]

The script idea:

import pandas as pd
import numpy as np

mystep = 0.2 # the step e.g. 0.2, 0.4, 0.6 

#define the function
def data_splitter(df):
    max_time = df['Time'].max() # get max value of Time for the current csv file (df)
    myrange= np.arange(0, max_time, mystep) # build the threshold range
    for k in range(len(myrange)):
        # build the upper values 
        temp = df[(df['Time'] >= myrange[k]) & (df['Time'] < myrange[k] + mystep)]
        temp.to_csv("/Users/serafeim/Downloads/aaa_{}.csv".format(k))

Now, call the function:

df = pd.read_csv("/Users/serafeim/Downloads/Testfile.csv")
data_splitter(df) # pass the df to the function and call the function

Finally, you can create a loop and pass each df one by one in the data_splitter() function.

To make more clear what the function does look this:

for k in range(len(myrange)):
    print myrange[k], myrange[k]+step

This prints:

0.0 0.2
0.2 0.4
0.4 0.6000000000000001
0.6000000000000001 0.8
0.8 1.0

So it creates the lower & upper thresholds automatically based on the max value of Time column of the current .csv file.

EDIT 2:

import glob, os
path = '/Volumes/'

mystep = 0.2 

for filename in glob.glob(os.path.join(path, '*.csv')):
    df = pd.read_csv(filename)
    data_splitter(df)

PUTTING ALL TOGETHER:

import pandas as pd
import numpy as np
import glob
import os

path = '/root/Desktop/TT1/'
mystep = 0.4

#define the function
def data_splitter(df, name):
    max_time = df['Time'].max() # get max value of Time for the current csv file (df)
    myrange= np.arange(0, max_time, mystep) # build the threshold range
    for k in range(len(myrange)):
        # build the upper values 
        temp = df[(df['Time'] >= myrange[k]) & (df['Time'] < myrange[k] + mystep)]
        temp.to_csv("/root/Desktop/T1/{}_{}.csv".format(name, k))

for filename in glob.glob(os.path.join(path, '*.csv')):
    df = pd.read_csv(filename)
    name = os.path.split(filename)[1] # get the name of the file
    data_splitter(df, name) # call the splitting function

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/196880/discussion-on-answer-by-serafeim-how-to-split-a-csv-file-into-multiple-csv-based). — Samuel Liew, Jul 23 '19 at 15:27
@serafeim I tried to write the output files to a different folder with the same subfolder names as original files. But no luck. Anyway thanks for all the help. I accepted your answer. In case if you know how to fix above question please post. Thanks again — Hasa, Jul 23 '19 at 16:26
done.. I accepted both the answers. But I think stackoverflow allow me to accept only one answer. Thats why one answer got unaccepted. It would great help if you can help me to solve the issue. :) thanks. — Hasa, Jul 23 '19 at 17:23
what exactly do you want to do? to write the new files where? with what name? does the code work on your system? — seralouk, Jul 23 '19 at 17:47
I got a folder '/root/Desktop/InputFiles' (this is my path). In this InputFiles folder there subfolders (lets say 2 subfolders) Sub1 and Sub2. In Sub1 folder there are a1.csv anf a2.csv file. Sub2 folder has b1.csv and b2.csv. I want to split all these csv files in 0.4 sec. The output files needs get written into '/root/Desktop/OutputFiles' path. In this OutputFiles folder I need to get two subfolders same names as earlier. Sub1 and Sub2. The split files of a1.csv and a2.csv need to get written in Sub1 folder and split files of b1.csv and b2.csv need to get written in Sub2 folder. — Hasa, Jul 23 '19 at 18:34
@serafeim In my InputFiles folder (according to code TT1 folder) there are more than csv file. But with the existing code only one csv file gets split. I really appreciate your help. Thank you so much for being patience with me :) and helping. — Hasa, Jul 23 '19 at 19:03
My current code uses `'*.csv'` and captures all the `.csv` files inside `path `. one solution in your case is to set directly `path='/root/Desktop/InputFiles/Sub1/'` , run the code and then `path='/root/Desktop/InputFiles/Sub2/'`. In you want to make it more automatic, you need another loop that finds Sub1 and Sub2 and changes the `path` automatically but this could be another question — seralouk, Jul 23 '19 at 19:30
No dear. I tried several times. with '*.csv' it does not captures all the .csv files inside path. Inside TT1 folder there two subfolders. And within those subfolders there are csv files. When I give the path as path='/root/Desktop/TT1 it wont process all the files within the subfolders — Hasa, Jul 24 '19 at 10:03
i know. read again my previous comment. i explained what to do — seralouk, Jul 24 '19 at 10:09
Okay.. I have posted it as new question https://stackoverflow.com/questions/57180566/capture-csv-files-within-subfolders-python-3-x — Hasa, Jul 24 '19 at 10:18

Prasanth Ganesan · Answer 2 · 2019-07-23T11:56:03.453

Assume that you have 2 directories: Source and Test. Source contains all the source csv files and the Test directory will have all the output files.

import os
import glob

os.chdir("/home/prasanth-8508/Downloads/Source")
for csv_file in glob.glob("*.csv"):
    contents, output_list = list(), list()
    with open(csv_file) as f:
        contents.append(f.read().replace('"', ""))

    contents = ''.join(contents).split('\n')
    header = contents[0]
    contents = contents[1:]
    op_file_counter = 1
    split_factor = float(input("Enter split factor:"))
    split_num = split_factor
    i = 0
    contents = list(filter(None, contents))

    while i < len(contents)-1:
        try:
            row = contents[i].split(",")
            if not(str(float(row[1])).startswith(str(split_num)[0:str(split_num).index(".")+2], 0, str(split_num).index(".")+2)):
                output_list.append(contents[i])
                i += 1
            else:
                if len(output_list) > 0:
                    with open("/home/prasanth-8508/Downloads/Test/file" + str(op_file_counter) + ".csv", "a+") as f:
                        f.write(header+'\n')
                        for j in output_list:
                            f.write(j+'\n')
                    op_file_counter += 1
                    output_list = list()
                split_num += split_factor
                split_num = round(split_num,1)
                print(split_num)
        except IndexError:
            break

    with open("/home/prasanth-8508/Downloads/Test/file" + str(op_file_counter) + ".csv", "a+") as f:
        f.write(header+'\n')
        for j in output_list:
            f.write(j+'\n')

    print(csv_file+" processed successfully")

I got more than 600 files after running the program which is too large to be shared.

How to split a csv file into multiple csv based on a given criterion?

2 Answers2

PUTTING ALL TOGETHER:

Linked