0

baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.

alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.

Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.

I'm using python.

Edit: I can other file formats if needed.

Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.

3 Answers3

6

SQLite is ideal solution for your application.

Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.

To access your data, use python sqlite3 library. You can use this tutorial on how to use it.

Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.

Community
  • 1
  • 1
mvp
  • 111,019
  • 13
  • 122
  • 148
  • 3
    IMO this is hands down the best option. Adding or editing entries is far easier and data retrieval will be far more robust than trying to work straight with a CSV. – user1231232141214124 Oct 23 '13 at 03:36
  • 1
    I second this, with the caveat that if your CSV rows are very large individually, you may need to move up to a "real" database like PostgreSQL. – Christian Ternus Oct 23 '13 at 03:44
1

I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.

Then it's easy to seek to the nth line, just multiply n by the line length

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • This approach has many problems. What if 1000 bytes is not enough? What if most records only occupy 100 bytes and just few 1000 bytes? Are you going to waste 90% of disk space? What if you decided to increase to 2000 bytes - how to reload the data? What if you need to sort data or get some subset? Real database like SQLite is really the only option – mvp Oct 23 '13 at 03:45
  • @mvp, Of course sqlite is more flexible, but the flexibility comes at a cost. For particular cases such as needing random access to a consecutive block of lines, sqlite is doing a bunch of extra work. – John La Rooy Oct 23 '13 at 04:07
0

10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.

My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:

myFileObject.seek(n*80)
theLine = myFileObject.read(80)
Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234