How to split .csv into multiple .csv by columns using R?

Question

I have a huge .CSV file with more than 3000 columns and need to load these into the database and because the table limitation is 1024 columns I want to split these .CSV files into multiple files of 1024 or less columns.

So far what I have tried with the help of previous questions regarding this topic -

Python -

 import csv 
 import file
 input import os 
 source = 'H:\\Programs\\Exploratory Items\\Outdated\\2018\\csvs\\' for root, dirs, filenames in os.walk(source): 
for f in filenames: 
fullpath = os.path.join(source, f) 
output_specifications = ( (fullpath[:-4] + 'pt1.csv', slice(1000)
(fullpath[:-4] + 'pt2.csv', slice(1000, 2000)), 
(fullpath[:-4] + 'pt3.csv', slice(2000, 3000)), 
(fullpath[:-4] + 'pt4.csv', slice(3000, 4000)), 
(fullpath[:-4] + 'pt5.csv', slice(4000, 5000)), ) 
output_row_writers = [ 
( 
  csv.writer(open(file_name, 'wb'), 
  quoting=csv.QUOTE_MINIMAL).writerow, selector,
  ) for file_name, selector in output_specifications ] 

 reader = csv.reader(fileinput.input()) 
 for row in reader: 
 for row_writer, selector in output_row_writers: row_writer(row[selector])

The issue with the above python code is it takes forever to split and write these files because it writes by row is my understanding. Not ideal for my case as I have more than 200 .CSV files with 1000+ rows in each.

Trying now -

-cut command (POSIX) but I use Windows so will try this on Ubuntu platform.

Want to try this in R:

I have a code that converts all my SPSS to .csv which works efficiently so I want to add more to this at this stage so it can split my file by column into multiple .csvs.

setwd("H:\\Programs\\2018")
getwd()
list.files()

files <- list.files(path = '.', pattern = '.sav')

library(foreign)
for (f in files) { #iterate over them
data <- read.spss(f, to.data.frame = TRUE, use.value.labels = FALSE )
write.csv (data, paste0(strsplit(f, split = '.', fixed = T)[[1]][1], '.csv'))
}

Thank you

References - Python code ref

First and foremost question: why does your table have thousands of columns? This is both a database design issue and dataset issue. Most analyses always require long format (not wide). — Parfait, Mar 13 '18 at 14:19
This is how I receive the data for processing something that I have no control over. This is still in .CSV format in order to load them into DB that has 1024 limitation I want to split this. @parafait — vap0991, Mar 13 '18 at 14:22
@vap0991 Can you upload at least part of this csv to something like [bpaste](https://bpaste.net/)? Maybe the top 100 rows or something? — G_M, Mar 13 '18 at 15:35
[link] (https://bpaste.net/raw/4fda3d64230e) @DeliriousLettuce — vap0991, Mar 13 '18 at 16:20
@vap0991 That isn't a csv file (comma separated), that looks like a tab-delimited file. That could be part of the problem with the Python code (and maybe even those shell commands) since they are looking to split on commas. — G_M, Mar 13 '18 at 16:23
@DeliriousLettuce Sorry I should have mentioned I copied it directly from EXCEL as the .csv format was killing the bpaste page but it is .csv — vap0991, Mar 13 '18 at 16:26
@vap0991 Oh, I see. It's just hard for me to experiment with it when it isn't in the proper format. — G_M, Mar 13 '18 at 16:27
@DeliriousLettuce Lets try this again - [link] (https://bpaste.net/show/1760955f5c3b) Let me know if that works. — vap0991, Mar 13 '18 at 16:30

score 0 · Answer 1 · answered Nov 25 '21 at 15:13

Late is better than never :) Here's the solution based on code generating library - convtools

from convtools import conversion as c
from convtools.contrib.tables import Table


columns_per_file = 1000
for filename in filenames:
    # reading columns
    columns = Table.from_csv(filename, header=False).columns

    # slicing
    for part_number, i in enumerate(
        range(0, len(columns), columns_per_file), 1
    ):
        Table.from_csv(filename, header=False).take(
            # taking only needed ones
            *(col for col in columns[i : i + columns_per_file])
        ).into_csv(
            # streaming to the part
            "{}.pt{}.csv".format(filename.replace(".csv", ""), part_number),
            include_header=False,
        )

How to split .csv into multiple .csv by columns using R?

1 Answers1