8

I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.

If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.

Thanks.

Curious2learn
  • 31,692
  • 43
  • 108
  • 125
  • 1
    It probably depends on what you're doing with the data - i.e. how do you define convenience. Do you 1. Want to view the data with a good visual editor whenever you want 2. Maybe check the data once visually and then pass it on to some other program ? If you want the first and need that functionality to 'view' the data then csv format is probably not a bad choice. Given 2, pytables is probably good. In either case, your data is small, why not save both? – Marm0t Oct 26 '10 at 13:18

6 Answers6

6

Have you considered Numpy arrays?

PyTables are wonderful when your data is too large to fit in memory, but a 200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think PyTables may be overkill.

You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.

If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 3 table columns: row, col, value:

import sqlite3
import numpy as np

db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
                  (row INTEGER,
                   col INTEGER,
                   value FLOAT,
                   PRIMARY KEY (row,col))''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
                   ((r,c,value) for r,row in enumerate(matrix) 
                                for c,value in enumerate(row)))

# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This sounds interesting. I don't know much about databases, but will look into this and post back. What is not clear to me from your example is that how are the coordinates of each value in the 2000 rows x 200 columns matrix are being assigned to the database table. I will try to figure that out. – Curious2learn Oct 26 '10 at 14:49
  • I don't think i'd use a separate `id` column for the primary key. the row/column makes a much better primary key, since it really the identification for that bit of data. for a 2000x2000 matrix, you could even get clever and pack both row and column into the same integer column, something like `row << 16 + col`. – SingleNegationElimination Jun 23 '11 at 02:46
2

As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.

PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.

The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.

kenm
  • 23,127
  • 2
  • 43
  • 62
  • Thanks for the feedback! Would you say that with ViTables, even PyTables become human readable. – Curious2learn Oct 26 '10 at 14:49
  • ViTables is great. BUT, it is something of a pain to install (until perhaps recently). – Dav Clark Jun 23 '11 at 01:10
  • floats *can* be stored exactly in ascii decimal, but requires quite a few decimal digits to do so. This is often not the default for string formatting of floats, though. – SingleNegationElimination Jun 23 '11 at 02:48
  • @TokenMacGuy, you can also use `float.hex` and `float.fromhex` to store and parse standard C99-style hexadecimal float literals. They store exactly. `(22.0/7.0).hex() => '0x1.9249249249249p+1'` – kenm Jun 23 '11 at 11:28
2

One big plus for PyTables is the storage of metadata, like variables etc. If you run the simulations more often with different parameters you the store the results as an array entry in the h5 file.

We use it to store measurement data + experiment scripts to get the data so it is all self contained.

BTW: If you need to look quickly into a hdf5 file you can use HDFView. It's a Java app for free from the HDFGroup. It's easy to install.

mrossi
  • 437
  • 6
  • 8
1

i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.

mossplix
  • 3,783
  • 2
  • 26
  • 31
1

This is actually quite related to another answer I've provided regarding reading / writing csv files w/ numpy:

Python: how to do basic data manipulation like in R?

You should definitely use numpy, no matter what else! The ease of indexing, etc. far outweighs the cost of the additional dependency (well, I think so). PyTables, of course, relies on numpy too.

Otherwise, it really depends on your application, your hardware and your audience. I suspect that reading in csv files of the size you're talking about won't matter in terms of speed compared to PyTables. But if that's a concern, write a benchmark! Read and write some random data 100 times. Or, if read times matter more, write once, read 100 times, etc.

I strongly suspect that PyTables will outperform SQL. SQL will rock on complex multi-table queries (especially if you do the same ones frequently), but even on single-table (so called "denormalized") table queries, pytables is hard to beat in terms of speed. I can't find a reference for this off-hand, but you may be able to dig something up if you mine the links here:

http://www.pytables.org/moin/HowToUse#HintsforSQLusers

I'm guessing execute performance for you at this stage will pale in comparison to coder performance. So, above all, pick something that makes the most sense to you!

Other points:

As with SQL, PyTables has an undo feature. CSV files won't have this, but you can keep them in version control, and you VCS doesn't need to be too smart (CSV files are text).

On a related note, CSV files will be much bigger than binary formats (you can certainly write your own tests for this too).

Community
  • 1
  • 1
Dav Clark
  • 1,430
  • 1
  • 13
  • 26
0

These are not "exclusive" choices.

You need both.

CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.

S.Lott
  • 384,516
  • 81
  • 508
  • 779