1

I'm looking for a way to retrieve sorted records from an hdf table. Here is a python MWE:

import tables
import numpy as np

class Measurement(tables.IsDescription):
    time = tables.Float64Col()
    value = tables.Float64Col()

h5 = tables.open_file('test.hdf', 'w')
h5.create_table('/', 'test', Measurement)

table = h5.root.test
data = np.array([(0, 6), (5, 1), (1, 8)], dtype=[('time', '<f8'), ('value', '<f8')])
table.append(data)
table.cols.time.createCSIndex()

Now I'd like to retrieve all records with time > 0, sorted by time. If I do:

table.read_where('time > 0')

then it gets:

array([(5.0, 1.0), (1.0, 8.0)], dtype=[('time', '<f8'), ('value', '<f8')])

which is not sorted by time. If I attempt to use read_sorted then I get the entire table instead of a subset (there's no condition argument to read_sorted).

What is the common practice? Should I ensure that my tables are stored sorted in the database? Or should I sort myself the retrieved set after read_where?

remus
  • 2,635
  • 2
  • 21
  • 46

1 Answers1

0

I don't think there is a one size fits all answer to your question.

If you are in a situation where you write once to the file and have to read it a lot of times, it would definitely be a good idea to store the tables in a sorted manner. For already existing files, you can use the ptrepack utility which can copy exisiting data in a sorted manner.

If you only read the data a few times, the storing in a sorted way might not be the most efficient way. Just read_where to get your data into memory and sort afterwards.

If your data is to big to fit into memory, you'll have to store the data in a sorted manner.

And there are more possibilities, depending on your system perfomance (SSD, HDD, network storage, CPU,...)

Ben K.
  • 1,160
  • 6
  • 20