Split csv into smaller files based on row value

Question

I have a large csv with the following header columns id, type,state,location

and the following values:

124, preschool, Pennsylvania, Pittsburgh
421, secondary school, Ohio, Cleveland
213, primary school, California, Los Angeles
155, secondary school, Pennsylvania, Pittsburgh
etc...

The file is not ordered and I want a csv file for each type of school.

The answers that I found were regarding to ordered csv files, or splitting them after a specific number of rows.

EDIT: I discovered what I wanted:

import csv

csv_file = 'school.csv'
value = 'preschool'

with open(csv_file, 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
    for row in spamreader:
        if value in row:
            with open(value + '.csv', 'ab') as myfile:
                spamwriter = csv.writer(myfile)
                spamwriter.writerow(row)
                myfile.close()

and to keep the header columns I just copy paste them from the original one

Hello Gabriel and welcome to the site. I'm not sure what exactly you're asking, could you please explain exactly what you want to happen, and what you have tried so far? — MackM, May 30 '18 at 16:38
I have a file named schools.csv which contains the id, type of school, the state of the school and the location. The file has 30.000 rows and I want to split it in smaller files based on the type of the school. Is this a better title for this question. Splitting the file based on the values in a specific column ? — monchisan, May 30 '18 at 18:07

score 0 · Accepted Answer · 2018-06-04T16:21:57.773

0

All you need is a row comparison. Then based on the match it appends to a different file.

Also The try/except checks if the file exists, if not writes the header. This is the simplest way to achieve the results you are looking for. You could also do a comparison if the file does exist to ensure the header is formatted in the manner of your choosing, but this answer already goes way beyond the scope of the original question.

import csv


csv_file = 'file/path/file_name'

values = ['preschool', 'secondary school']


def csv_header(x):
    with open(x + '.csv', 'ab') as myfile:
        myfile.write("%s %s %s %s \n" % ('id', 'type', 'state', 'location'))
        myfile.close()


def csv_writer(y, value):
    for row in y:
        if value in row:
            with open(value + '.csv', 'ab') as myfile:
                spamwriter = csv.writer(myfile)
                spamwriter.writerow(row)
                myfile.close()


def csv_reader(z):
    with open(z + '.csv', 'rb') as spam:
        spamreader = csv.reader(spam, delimiter=',', quotechar='|')
        csv_writer(spamreader, value)


for value in values:
    try:
        csv_reader(value)
        csv_reader(csv_file)

    except:
        csv_header(value)
        csv_reader(csv_file)

https://docs.python.org/2/library/csv.html

edited Jun 04 '18 at 16:21

answered May 30 '18 at 16:42

what library do I need to use? can you give me a full code? – monchisan May 30 '18 at 18:09
I edited with a complete working script and included documentation link. – May 30 '18 at 18:25
thank you very much. Unfortunately I don't have 15 reputation so I can't upvote you :( – monchisan May 30 '18 at 18:42
Well get out there and get yourself some rep points. Then come back and revisit this post. Glad I could help. – May 30 '18 at 18:45
Hello again. The output csv has no header columns and all the data is on the same row. Something like this: `124, preschool, Pennsylvania, Pittsburgh, 213, preschool, Pennsylvania, Pittsburgh` – monchisan May 31 '18 at 07:27
I say this because I need the csv in the same format, because I will convert it later in GeoJSON and I need to keep the properties. The output csv got with this code returns an empty geojson when it is converted – monchisan May 31 '18 at 07:32
I added the new line to the writer, but for the headers this is a separate question really. Do you want to write a new file each time or append the file. If you append then you would need comparison statements. This has already been answered though... https://stackoverflow.com/questions/20347766/pythonically-add-header-to-a-csv-file#20348305 – Jun 01 '18 at 13:55
I want each output file to have the header in it – monchisan Jun 01 '18 at 14:45
I understand what you want. I'm saying that you need to decide if each time you run the script if you will overwrite your old data with open(value + '_.csv', 'wb') or append the data the already exists with open(value + '_.csv', 'ab'). If you append you will need to write a comparison statement to check if the headers already exist. – Jun 01 '18 at 14:49
append is better – monchisan Jun 03 '18 at 12:42
Then you need to edit the question you posted...It should be something like "how to split CSV into child files based on row values and write header if it doesn't exist in child CSVs?"....either way I have edited my answer. – Jun 04 '18 at 16:18

Split csv into smaller files based on row value

1 Answers1