3

I have a huge .CSV file with more than 3000 columns and need to load these into the database and because the table limitation is 1024 columns I want to split these .CSV files into multiple files of 1024 or less columns.

So far what I have tried with the help of previous questions regarding this topic -

Python -

 import csv 
 import file
 input import os 
 source = 'H:\\Programs\\Exploratory Items\\Outdated\\2018\\csvs\\' for root, dirs, filenames in os.walk(source): 
for f in filenames: 
fullpath = os.path.join(source, f) 
output_specifications = ( (fullpath[:-4] + 'pt1.csv', slice(1000)
(fullpath[:-4] + 'pt2.csv', slice(1000, 2000)), 
(fullpath[:-4] + 'pt3.csv', slice(2000, 3000)), 
(fullpath[:-4] + 'pt4.csv', slice(3000, 4000)), 
(fullpath[:-4] + 'pt5.csv', slice(4000, 5000)), ) 
output_row_writers = [ 
( 
  csv.writer(open(file_name, 'wb'), 
  quoting=csv.QUOTE_MINIMAL).writerow, selector,
  ) for file_name, selector in output_specifications ] 

 reader = csv.reader(fileinput.input()) 
 for row in reader: 
 for row_writer, selector in output_row_writers: row_writer(row[selector])

The issue with the above python code is it takes forever to split and write these files because it writes by row is my understanding. Not ideal for my case as I have more than 200 .CSV files with 1000+ rows in each.

Trying now -

-cut command (POSIX) but I use Windows so will try this on Ubuntu platform.

Want to try this in R:

I have a code that converts all my SPSS to .csv which works efficiently so I want to add more to this at this stage so it can split my file by column into multiple .csvs.

setwd("H:\\Programs\\2018")
getwd()
list.files()

files <- list.files(path = '.', pattern = '.sav')

library(foreign)
for (f in files) { #iterate over them
data <- read.spss(f, to.data.frame = TRUE, use.value.labels = FALSE )
write.csv (data, paste0(strsplit(f, split = '.', fixed = T)[[1]][1], '.csv'))
}  

Thank you

References - Python code ref

vap0991
  • 113
  • 8
  • where is the 1024 column limit from exactly? the database? – MichaelChirico Mar 13 '18 at 14:19
  • First and foremost question: why does your table have thousands of columns? This is both a database design issue and dataset issue. Most analyses always require long format (not wide). – Parfait Mar 13 '18 at 14:19
  • This is how I receive the data for processing something that I have no control over. This is still in .CSV format in order to load them into DB that has 1024 limitation I want to split this. @parafait – vap0991 Mar 13 '18 at 14:22
  • @MichaelChirico yes the DB – vap0991 Mar 13 '18 at 14:22
  • @vap0991 Can you upload at least part of this csv to something like [bpaste](https://bpaste.net/)? Maybe the top 100 rows or something? – G_M Mar 13 '18 at 15:35
  • @DeliriousLettuce Sure let me do that now – vap0991 Mar 13 '18 at 15:38
  • [link] (https://bpaste.net/raw/4fda3d64230e) @DeliriousLettuce – vap0991 Mar 13 '18 at 16:20
  • @vap0991 That isn't a csv file (comma separated), that looks like a tab-delimited file. That could be part of the problem with the Python code (and maybe even those shell commands) since they are looking to split on commas. – G_M Mar 13 '18 at 16:23
  • @DeliriousLettuce Sorry I should have mentioned I copied it directly from EXCEL as the .csv format was killing the bpaste page but it is .csv – vap0991 Mar 13 '18 at 16:26
  • @vap0991 Oh, I see. It's just hard for me to experiment with it when it isn't in the proper format. – G_M Mar 13 '18 at 16:27
  • @DeliriousLettuce Lets try this again - [link] (https://bpaste.net/show/1760955f5c3b) Let me know if that works. – vap0991 Mar 13 '18 at 16:30

1 Answers1

0

Late is better than never :) Here's the solution based on code generating library - convtools

from convtools import conversion as c
from convtools.contrib.tables import Table


columns_per_file = 1000
for filename in filenames:
    # reading columns
    columns = Table.from_csv(filename, header=False).columns

    # slicing
    for part_number, i in enumerate(
        range(0, len(columns), columns_per_file), 1
    ):
        Table.from_csv(filename, header=False).take(
            # taking only needed ones
            *(col for col in columns[i : i + columns_per_file])
        ).into_csv(
            # streaming to the part
            "{}.pt{}.csv".format(filename.replace(".csv", ""), part_number),
            include_header=False,
        )

westandskif
  • 972
  • 6
  • 9