-1

I am trying to split a CSV file containing stock data of 1500+ companies. The first column contains dates and subsequent columns contain company data.

Goal 1: I'm trying to split the huge CSV file into smaller pieces. Let's say 30 companies per smaller file. To do this, I need to split the CSV by column number, not rows. I've been looking up code snippets but I haven't found anything that does this exactly. Also, each separate file would need to contain the first column, i.e. the dates.

Goal 2: I want to make the company name a column of its own, the date a column of its own and the indicators columns of their own. So, I can call the data for a company as a single record (row) in Django - I don't need all the dates, just the last day of every quarter. Right now, I'm having to filter the data by date and indicator and set that as an object to display in my frontend.

enter image description here

If you have questions, just ask.

EDIT:

Following is some code I patched together.

import os
import csv
from math import floor
from datetime import datetime
import re

class SimFinDataset:
    def __init__(self, dataFilePath, csvDelimiter = "semicolon"):

        self.numIndicators = None
        self.numCompanies = 1

        # load data
        self.loadData(dataFilePath, csvDelimiter)

    def loadData(self, filePath, delimiter):

        numRow = 0

        delimiterChar = ";" if delimiter == "semicolon" else ","

        csvfile = open(filePath, 'rb')
        reader = csv.reader(csvfile, delimiter=delimiterChar, quotechar='"')
        header = next(reader)
        row_count = sum(1 for _ in reader)
        csvfile.seek(0)

        for row in reader:
            numRow += 1
            if numRow > 1 and numRow != row_count and numRow != row_count-1:
                # company id row
                if numRow == 2:
                    rowLen = len(row)
                    idVal = None
                    for index, columnVal in enumerate(row):
                        if index > 0:
                            if idVal is not None and idVal != columnVal:
                                self.numCompanies += 1
                                if self.numIndicators is None:
                                    self.numIndicators = index - 1
                            if index + 1 == rowLen:
                                if self.numIndicators is None:
                                    self.numIndicators = index
                            idVal = columnVal
                if numRow > 2 and self.numIndicators is None:
                    return
                else:
                    filename = 1
                    with open(str(filename) + '.csv', 'wb') as csvfile:
                        if self.numCompanies % 30 == 0:
                            print ("im working")
                            spamwriter = csv.writer(csvfile, delimiter=';')
                            spamwriter.writerow(header)
                            spamwriter.writerow(row)
                            filename += 1

                #print (self.numIndicators)

dataset = SimFinDataset('new-data.csv','semicolon')
Tanzir Rahman
  • 165
  • 1
  • 2
  • 19
  • Did you try anything? Show us your code. –  Apr 25 '18 at 07:42
  • I can't figure out how to split CSV by column numbers. The code I have is all wrong, I've been splitting by row limits using code snippets I found. Please just tell me how to divide this file by columns. – Tanzir Rahman Apr 25 '18 at 07:45

1 Answers1

0

a solution for goal1 is here. splitting CSV file by columns

However you have the pandas way:

import pandas as pd

# let's say first 10 columns

csv_path="mycsv.csv"
out_path ="\\...\\out.csv"
pd.read_csv(csv_path).iloc[:, :10].to_csv(out_path)

You can also do something like

mydf.groupby("company_name").unstack()`

To make each company a column of its own

sslloo
  • 521
  • 2
  • 10