Program for working with large CSV Files

Question

Are there any good programs for dealing with reading large CSV files? Some of the datafiles I deal with are in the 1 GB range. They have too many lines for Excel to even deal with. Using Access can be a little slow, as you have to actually import them into a database to work with them directly. Is there a program that can open large CSV files and give you a simple spreadsheet layout to help you easily and quickly scan through the data?

Yes, there is. You can use [OpenRefine][1] (or Google Refine). OpenRefine is like a spreadsheet on steroids. The file size that you can manipulate depend on your computer's memory. [1]: http://openrefine.org — Estevão Lucas, Oct 05 '15 at 21:52

score 8 · Answer 1 · edited Sep 18 '12 at 10:34

8

I've found reCSVeditor is a great program for editing large CSV files. It's ideal for stripping out unnecessary columns. I've used it for files 1,000,000 record files quite easily.

edited Sep 18 '12 at 10:34

j0k

22,600
28
79
90

answered Sep 18 '12 at 10:33

David Sealey

81
1
1

+1 reCSVeditor worked for me with nearly 2GB file of >2,000,000 rows – Stuart Allen Jul 07 '13 at 09:03
hey, i downloaded the zip but i cant figure how to use it, can you please guide me how to? – aasthetic Jun 02 '14 at 10:20
@richi_18007 Recsveditor unzip the contents then run the installer – Bruce Martin Jun 26 '14 at 04:54

Jordi Bunster · Accepted Answer · 2008-09-05T00:22:08.487

8

MySQL can import CSV files very quickly onto tables using the LOAD DATA INFILE command. It can also read from CSV files directly, bypassing any import procedures, by using the CSV storage engine.

Importing it onto native tables with LOAD DATA INFILE has a start up cost, but after that you can INSERT/UPDATE much faster, as well as index fields. Using the CSV storage engine is almost instantaneous at first, but only sequential scan will be fast.

Update: This article (scroll down to the section titled Instant Data Loads) talks about using both approaches to loading CSV data onto MySQL, and gives examples.

edited Sep 05 '08 at 00:22

answered Sep 05 '08 at 00:13

Jordi Bunster

4,886
3
28
22

i did work with Real Estate MLS datasets that consisted of 15-30MB CSV file's. Without MySQL LOAD INFILE, each feed would have taken a hour or more to process.... but using MySQL and raw tables I cut processing down to 5-6 minutes for even the larger data sets. – David Sep 18 '08 at 21:35

score 2 · Answer 3 · edited Aug 29 '12 at 14:50

2

vEdit is great for this. I routinely open up 100+ meg (i know you said up to one gig, I think they advertise on their site it can handle twice that) files with it. It has regex support and loads of other features. 70 dollars is cheap for the amount you can do with it.

edited Aug 29 '12 at 14:50

Andy McCluggage

37,618
18
59
69

answered Sep 04 '08 at 17:44

kemiller2002

113,795
27
197
251

score 1 · Answer 4 · answered Sep 04 '08 at 17:51

1

GVim can handle files that large for free if you are not attached to a true spreadsheet static field size view.

answered Sep 04 '08 at 17:51

EBGreen

36,735
12
65
85

score 0 · Answer 5 · answered Sep 04 '08 at 17:49

0

vEdit is great but don't forget you can always go back to "basics" check out Cygwin and start greping.

Helpfull commands

grep
head
tail
of course perl!

answered Sep 04 '08 at 17:49

jason saldo

9,804
5
34
41

score 0 · Answer 6 · answered Sep 04 '08 at 17:56

It depends on what you actually want to do with the data. Given a large text file like that you typically only want a smaller subset of the data at any one time, so don't overlook tools like 'grep' for pulling out the pieces you want to look for and work with.

score 0 · Answer 7 · answered Sep 04 '08 at 23:16

If you can fit the data into memory and you like python then I recommend checking out the UniTable portion of Augustus. (Disclaimer: Augustus is open source (GPLv2) but I work for the company that writes it.)

It's not very well documented but this should help you get going.

from augustus.kernel.unitable import *
a = UniTable().from_csv_file('filename')
b = a.subtbl(a['key'] == some_value) #creates a subtable

It won't directly give you an excel like interface but with a little bit of work you can get many statistics out quickly.

Program for working with large CSV Files

7 Answers7