11

I have data coming from a csv which has a few thousand columns and ten thousand (or so) rows. Within each column the data is of the same type, but different columns have data of different type*. Previously I have been pickling the data from numpy and storing on disk, but it's quite slow, especially because usually I want to load some subset of the columns rather than all of them.

I want to put the data into hdf5 using pytables, and my first approach was to put the data in a single table, with one hdf5 column per csv column. Unfortunately this didn't work, I assume because of the 512 (soft) column limit.

What is a sensible way to store this data?

* I mean, the type of the data after it has been converted from text.

acrophobia
  • 924
  • 7
  • 22

5 Answers5

2

This might not, in fact, be possible to do in a naive way. HDF5 allocates 64 kb of space for meta-data for every data set. This meta data includes the types of the columns. So while the number of columns is a soft limit, somewhere in the 2-3 thousand range you typically run out of space to store the meta data (depending on the length of the column names, etc).

Furthermore, doesn't numpy limit the number of columns to 32? How are you representing the data with numpy now? Anything that you can get into a numpy array should correspond to a pytables Array class.

Anthony Scopatz
  • 3,265
  • 2
  • 15
  • 14
2

No pytables, but with h5py instead, this could work:

data = np.recfromcsv(args[0], delimiter=',',
                     case_sensitive=True, deletechars='', replace_space=' ')
with h5py.File(args[1], 'w') as h5file:
    h5file.create_dataset('table', data=data)

I borrowed the first line from this answer; not sure if that works for you. The HDF 5 table looks fine (from a quick look with hdfview); of course, I don't know if you can use it with pytables and perhaps pandas.

Community
  • 1
  • 1
1

Perhaps you can increase the number columns without much performance degradation. See: http://www.pytables.org/docs/manual-2.2.1/apc.html

C.1.1. Recommended maximum values

MAX_COLUMNS

Maximum number of columns in Table objects before a PerformanceWarning is issued. This limit is somewhat arbitrary and can be increased.

If you want to go this route, simply find the parameters.py file in pytables directory and change the MAX_COLUMNS value.

Paul
  • 7,155
  • 8
  • 41
  • 40
  • Yes, I did consider this, but thought that if I had to change this kind of config from the outset, I was probably doing things wrong. – acrophobia Nov 18 '13 at 12:26
  • If your data is sparse (meaning many of the rows only have a small subset of column values present) you could try refactoring your tables. Alternately I know that numpy has options to load only certain columns via x, y = np.loadtxt(file, delimiter=',', usecols=(0, 2)). So once you have the data on disk you could load various subsets of columns. I'm not sure if the function is optimized for such a case. – Paul Nov 18 '13 at 14:22
1

you should be able to use pandas dataframe it can be saved to disk without converting to csv

yemu
  • 26,249
  • 10
  • 32
  • 29
1

IMHO it depends on what do you want to do with the data afterwards and how much of it do you need at one time. I had to build a program for statistical validation a while ago and we had two approaches:

  1. Split the columns in separate tables (e.g. using a FK). The overhead of loading them is not too high
  2. Transpose the table, resulting in something like a key-value store, where the key is a tuple of (column, row)

For both we used postgres.

Laur Ivan
  • 4,117
  • 3
  • 38
  • 62