0

I read this question but it does not seem to give a good reply to my situation (maybe I misread it): Splitting very large csv files into smaller files

I have a large CSV file (1.0 Gb) with lots of individual rows (over 1 million) and 8 columns. Two columns represent date and time while others represent other stock related information (price, etc.). I would like to save individual files separated by date and time attributes. So, if there are 100 different combinations of date-time, I would extract rows for each combination and save is as a separate CSV file under subfolder like C:/Date/Time/filename.csv.

I am currently using pandas to filter each time-date combination to get a dataset with required information for that combination and then saving each file using loops (a list of date-time combinations is looped through). This is taking a very long time.

Is there a better way to accomplish this?

(I will look into multithreading as well but I don't believe it will solve the speed issue significantly).

Thanks!

Adding a sample of input data:

enter image description here

Vasiliy
  • 105
  • 1
  • 5
  • 1
    What does *"combinations of date-time"* mean, exactly? Show a meaningful sample of the input. Also explain if the input is ordered by date. – Tomalak Nov 08 '21 at 17:21
  • @Tomalak, I added a sample of input data. Column quote_datetime has a date and time (hours:minutes). There is also a date column called expiration; there may be a lot of different expiration dates even though only one is shown in the sample I added. I need to save different files with all columns listed in the sample where each hour/minutes from quote_datetime column and each expiration date form a unique combination: a file for 9:31 and 9/7/2016, another file for 9:32 and 9/7/2016, etc. – Vasiliy Nov 08 '21 at 17:53
  • "Is there a better way to accomplish this?" - Yes, it's called SQL. – Antimon Nov 08 '21 at 18:26
  • @Antimon, can you please elaborate on this? I am new to working with such large quantities of data. Are you suggesting putting everything in a database instead of csv files? Or putting in a database and then creating csv files using the database? – Vasiliy Nov 08 '21 at 18:35
  • 1
    I'd suggest putting your 1 GB csv file into a SQL database. That will definitely be a better tool than wrangling everything from single csv file with python. If you must, you can still export subsets of that data to csv files later on. – Antimon Nov 08 '21 at 18:49
  • That's solid advice. Read all of it into an 8-column table, set up indexes on the date columns, and use SQL to query the data set. A million rows is nothing for a half-way decent database. – Tomalak Nov 08 '21 at 19:06

0 Answers0