6

I am using pandas to join several huge csv files using HDFStore. I'm merging all the other tables to a base table, base. Right now I create a new table in the HDFStore for the output of each merge, which I call temp. Then I delete the old base table. Finally, I copy temp to base and start the process over again on the next table I need to join.

This would be much more efficient if I could simply rename temp to base. Is this possible?

Luke
  • 6,699
  • 13
  • 50
  • 88
  • Luke, I'm curious why you wouldn't just append additional csv's directly to the base table, rather than have the intermediate (slow) step of creating a new table? – fantabolous Aug 12 '14 at 12:08

1 Answers1

7

Yes, it is possible. You have to delve into the methods from PyTables, on which HDFStore depends.

Out[20]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/a            frame        (shape->[3,1])

In [21]: store.get_node('a')._f_rename('b')

In [22]: store
Out[22]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/b            frame        (shape->[3,1])

The same method works on frame_table appendable nodes.

Dan Allan
  • 34,073
  • 6
  • 70
  • 63
  • Thanks, oddly there doesn't appear to be any speed improvement. – Luke Apr 01 '14 at 23:23
  • Hmm. I'm not deeply familiar with the internals. If @Jeff drops by he might be able to shed some light on this. – Dan Allan Apr 01 '14 at 23:24
  • using your procedure the file will continue to grow; you should ptrepack if you are deleting a lot. not clear where you think a speed up would be – Jeff Apr 01 '14 at 23:38
  • I think the speedup would be renaming the node `temp` to `base` instead of copying the node `temp` into `base`, naively analogous to `mv` vs `cp`. – Dan Allan Apr 01 '14 at 23:44
  • @Jeff, Dan Allan is right about where I thought the speed-up might be. Would ptrepack be faster than deleting and changing the name or is that essentially what ptrepack is doing? – Luke Apr 02 '14 at 00:01
  • 1
    No renaming is fine and @Dan Allan answer is right. Deleting doesn't reclaim space, nor make the store more efficient. ptrepack repacks the file to compute an optimal chunksize and reclaims space. See here: http://pandas.pydata.org/pandas-docs/stable/io.html#compression – Jeff Apr 02 '14 at 00:03
  • I found this method helpful for other things too (such as getting a list of children in a store group). List of properties/methods: http://pytables.github.io/usersguide/libref/hierarchy_classes.html – fantabolous Aug 13 '14 at 02:41