split csv file into multiple files csv file based on column values - column count is not fixed for each entity

Question

I have a CSV file that has the 3rd column as an entity like A1, B1, C1, etc. but all entity has different columns count for example --

Entity A1 has 7 columns
Entity B1 has 10 columns
Entity C1 has 5 columns
Entity D1 has 20 columns The CSV file has data like the below combine.csv

New,2923,A1,3443,44,333,4Y
New,2924,A1,3443,43,333,3Y
New,2925,A1,3443,42,333,3Y
New,2925,A1,3443,41,333,2Y
New,2122,B1,3413,21,313,3Y,WESS,F,Date=20230510
New,2121,B1,3423,21,334,3Y,WESS,F,Date=20230510
New,2120,B1,3423,21,335,2Y,WESS,F,Date=20230510
New,2122,C1,3413,21,313
New,2121,C1,3423,21,334
New,2120,C1,3423,21,335

three separate CSV files
A1.csv 
New,2923,A1,3443,44,333,4Y 
New,2924,A1,3443,43,333,3Y 
New,2925,A1,3443,42,333,3Y 
New,2925,A1,3443,41,333,2Y 

B1.csv 
New,2122,B1,3413,21,313,3Y,WESS,F,Date=20230510
New,2121,B1,3423,21,334,3Y,WESS,F,Date=20230510
New,2120,B1,3423,21,335,2Y,WESS,F,Date=20230510 

C1.csv 
New,2122,C1,3413,21,313 
New,2121,C1,3423,21,334 
New,2120,C1,3423,21,335

so I need to find any solution.

I need to find any solution to generate separate CSV files

split the stream by " ", after that on each data element search for A1,B1,C1,D1 and make a dictionary. using Entity as key and respective list item to it. — Karan Gehlod, Jun 23 '23 at 12:52

score 1 · Answer 1 · answered Jun 23 '23 at 13:33

Python approach

I take a different approach from the others.

import csv
import os

# Define where the data should go
_outFolder = r'C:\ssisdata\Output'
# What is our source file
_src = r'C:\ssisdata\Input\so_76534070.csv'

# I am going to use a dictionary to accumulate all the rows by Entity (key)
# This is a terrible approach for a machine with limited memory and very large source files
_memory_eater = {}

# Use the CSV libraries to parse the CSV and enumerate through the lines
with open(_src, 'r') as file:
    csvreader = csv.reader(file, delimiter=',', quotechar='"')
    for row in csvreader:
        # what is the 3rd element
        _element = row[2]
        if (not _element in _memory_eater):
            # add it
            _memory_eater[_element] = []

        # add the row to the dictionary
        # Lazy encoding, if you need escapting, etc, then use the library
        _memory_eater[_element].append(",".join(row))
        

# What are the unique entities?
_entities = list(_memory_eater.keys())
print(_entities)

# at this point, we now need to save as output
for _entity in _entities:
    # Generates a file like so_A1.csv, so_B1.csv etc in the Output folder
    _outFileName = os.path.join(_outFolder, "so_{0}.csv".format(_entity))
    #print(_outFileName)
    with open(_outFileName, 'w') as fout:
       # Convert the list of value in the dictionary to a 
       #newline delimited string and write to file
        _data = "\n".join(_memory_eater[_entity])
        fout.writelines(_data)

.NET

You'll have a similar approach here but I don't know of a native CSV library. Unless you add a reference to the VB.net libraries and then something like Parse Delimited CSV in .NET might get you there

SSIS

As a soul with 18 years of experience with writing SSIS packages, you could make this work in a data flow but as someone with 18 years of experience, you do not want to make this work with a Data Flow.

A data flow is all about consistent metadata and this is not consistent.

You could use SSIS as the orchestration and execution framework but in that case, you're going to write a Script Task which is covered under the .NET approach.

richard · Answer 2 · 2023-06-23T13:36:53.547

0

https://onlinegdb.com/_MXLJp8kX3

import csv

csv_map = {}
with open('source.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        key = row[2]
        if key not in csv_map:
            fh = open(f'{key}.csv', 'w', newline='')
            csv_map[key] = {
                'fh': fh, 'writer': csv.writer(fh)
                }
        csv_map[key]['writer'].writerow(row)

for key in csv_map: csv_map[key]['fh'].close()

edited Jun 23 '23 at 13:36

answered Jun 23 '23 at 13:00

richard

144
3

thanks, but no of entity is also not fixed it would different each file, I not entity count , some a new entity should be introduced – hardik rawal Jun 23 '23 at 13:06
updated reply with new code given your new details. – richard Jun 23 '23 at 13:38

score 0 · Answer 3 · answered Jun 23 '23 at 13:13

#create a diction with key as A1, B1, C1 and values as list string
dict = {"A1": [], "B1": [], "C1": []}
#create a stream of data
data = "New,2923,A1,3443,44,333,4Y New,2924,A1,3443,43,333,3Y New,2925,A1,3443,42,333,3Y New,2925,A1,3443,41,333,2Y New,2122,B1,3413,21,313,3Y,WESS,F,Date=20230510 New,2121,B1,3423,21,334,3Y,WESS,F,Date=20230510 New,2120,B1,3423,21,335,2Y,WESS,F,Date=20230510 New,2122,C1,3413,21,313 New,2121,C1,3423,21,334 New,2120,C1,3423,21,335"
#split the data into a list
data = data.split(" ")

#for loop on each element in the list
for i in data:
    if(i.split(",")[2] == "A1"):
        #add value to the dictionary with key as A1 whenever data contains A1
        dict["A1"] = dict.get("A1", []) + [i]
    elif(i.split(",")[2] == "B1"):
        #add value to the dictionary with key as B1 whenever data contains B1
        dict["B1"] = dict.get("B1", []) + [i]
    elif(i.split(",")[2] == "C1"):
        #add value to the dictionary with key as C1 whenever data contains C1
        dict["C1"] = dict.get("C1", []) + [i]

#print data for each key in the dictionary
print(dict["A1"])
print(dict["B1"])
print(dict["C1"])

score 0 · Answer 4 · answered Jun 23 '23 at 14:47

Asuming that every third entry in your combined CSV file always represent an entity or category, the following works for unlimited categories.

csv_content = []
with open('combined.csv','r') as file:
    csv_content = file.readlines()
csv_dict = {}
for line in csv_content:
    temp = line.split(',')
    if len(temp)  > 1:
        if temp[2] in csv_dict.keys():
            csv_dict[temp[2]].append(line)
        else:
            csv_dict[temp[2]] = []
            csv_dict[temp[2]].append(line)
    else:
        pass

for key in csv_dict:
    file_name = key+'.csv'
    with open(file_name,'a+') as file:
        for item in csv_dict[key]:
            file.write(item)

Note file read and write location is also assumed to be same as the script location. Otherwise We need to specify input and output location as follows:

input_path = 'C:\\Source\\combined.csv'
output_directory = 'C:\\Destination'

csv_content = []
with open(input_path,'r') as file:
    csv_content = file.readlines()
csv_dict = {}
for line in csv_content:
    temp = line.split(',')
    if len(temp)  > 1:
        if temp[2] in csv_dict.keys():
            csv_dict[temp[2]].append(line)
        else:
            csv_dict[temp[2]] = []
            csv_dict[temp[2]].append(line)
    else:
        pass

for key in csv_dict:
    file_name = output_directory+'\\'+key+'.csv'
    with open(file_name,'a+') as file:
        for item in csv_dict[key]:
            file.write(item)

split csv file into multiple files csv file based on column values - column count is not fixed for each entity

4 Answers4

Python approach

.NET

SSIS