2

I have a really huge CSV files. There are about 1700 columns and 40000 rows like below:

x,y,z,x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1700 more)...,x1700
0,0,0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1700 more)...,a1700
1,1,1,b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1700 more)...,b1700
// (about 40000 more rows below)

I need to split this CSV file into multiple files which contain a less number of columns like:

# file1.csv
x,y,z
0,0,0
1,1,1
... (about 40000 more rows below)

# file2.csv
x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1000 more)...,x1000
a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1000 more)...,a1000
b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1000 more)...,b1700
// (about 40000 more rows below)

#file3.csv
x1001,x1002,x1003,x1004,x1005,...(about 700 more)...,x1700
a1001,a1002,a1003,a1004,a1005,...(about 700 more)...,a1700
b1001,b1002,b1003,b1004,b1005,...(about 700 more)...,b1700
// (about 40000 more rows below)

Is there any program or library doing this?

I've googled for it , but programs that I found only split a file by rows not by columns.

Or which language could I use to do this efficiently?

I can use R, shell script, Python, C/C++, Java

Jeon
  • 4,000
  • 4
  • 28
  • 73

4 Answers4

3

A one-line solution for your example data and desired output:

cut -d, -f -3 huge.csv > file1.csv
cut -d, -f 4-1004 huge.csv > file2.csv
cut -d, -f 1005- huge.csv > file3.csv

The cut program is available on most POSIX platforms and is part of GNU Core Utilities. There is also a Windows version.

update in python, since the OP asked for a program in an acceptable language:

# python 3 (or python 2, if you must)
import csv
import fileinput

output_specifications = (  # csv file name, selector function
    ('file1.csv', slice(3)),
    ('file2.csv', slice(3, 1003)),
    ('file3.csv', slice(1003, 1703)),
)
output_row_writers = [
    (
        csv.writer(open(file_name, 'wb'), quoting=csv.QUOTE_MINIMAL).writerow,
        selector,
    ) for file_name, selector in output_specifications
]

reader = csv.reader(fileinput.input())
for row in reader:
    for row_writer, selector in output_row_writers:
        row_writer(row[selector])

This works with the sample data given and can be called with the input.csv as an argument or by piping from stdin.

Sean Summers
  • 2,514
  • 19
  • 26
1

Use a small python script like:

fin = 'file_in.csv'
fout1 = 'file_out1.csv'
fout1_fd = open(fout1,'w')
...
lines = []

with open(fin) as fin_fd:
   lines = fin_fd.read().split('\n')

for l in lines:
   l_arr = l.split(',')
   fout1_fd.write(','.join(l_arr[0:3]))        
   fout1_fd.write('\n')   
   ...

...
fout1_fd.close()
...
0

You can open the file in Microsoft Excel, delete the extra columns, save as csv for file #1. Repeat the same procedure for the other 2 tables.

MovieFan
  • 3
  • 1
0

I usually use open office ( or microsof excel in case you are using windows) to do that without writing any program and change the file and save it. Following are two useful links showing how to do that.

https://superuser.com/questions/407082/easiest-way-to-open-csv-with-commas-in-excel

http://office.microsoft.com/en-us/excel-help/import-or-export-text-txt-or-csv-files-HP010099725.aspx

Community
  • 1
  • 1
Reza
  • 1,516
  • 14
  • 23