Questions tagged [hdf5]

The Hierarchical Data Format (HDF5) is a binary file format designed to store large amount of numerical data.

HDF5 refers to:

  • A binary file format designed to store efficiently large amount of numerical data
  • Libraries of function to create and manipulate these files

Main features

  • Free
  • Completely portable
  • Very mature
  • No limit on the number and size of the datasets
  • Flexible in the kind and structure of the data and meta-data
  • Complete library in C and Fortran well documented
  • A lot of wrappers and tools are available (Python, Matlab, Java, …)

Some links to get started

2598 questions
16
votes
3 answers

Sparse array support in HDF5

I need to store a 512^3 array on disk in some way and I'm currently using HDF5. Since the array is sparse a lot of disk space gets wasted. Does HDF5 provide any support for sparse array ?
andreabedini
  • 1,295
  • 1
  • 13
  • 20
16
votes
1 answer

Writing a large hdf5 dataset using h5py

At the moment, I am using h5py to generate hdf5 datasets. I have something like this import h5py import numpy as np my_data=np.genfromtxt("/tmp/data.csv",delimiter=",",dtype=None,names=True) myFile="/tmp/f.hdf" with h5py.File(myFile,"a") as f: …
NinjaGaiden
  • 3,046
  • 6
  • 28
  • 49
15
votes
9 answers

How to best write out a std::vector < std::string > container to a HDF5 dataset?

Given a vector of strings, what is the best way to write them out to a HDF5 dataset? At the moment I'm doing something like the following: const unsigned int MaxStrLength = 512; struct TempContainer { char string[MaxStrLength]; }; …
Richard Corden
  • 21,389
  • 8
  • 58
  • 85
15
votes
2 answers

Sharing large datasets between Matlab and R

I need a relatively efficient way to share data between Matlab and R. I have checked SaveR and MATLAB R-link, but SaveR formats Matlab's binary data as text strings first and then prints them to an ASCII file, which is not efficient for large…
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
15
votes
2 answers

When reading huge HDF5 file with "pandas.read_hdf() ", why do I still get MemoryError even though I read in chunks by specifying chunksize?

Problem description: I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. The problem happens when reading it back. Even though I tried to read it back in chunks, I still get…
Ewan
  • 415
  • 1
  • 4
  • 13
15
votes
7 answers

Converting hdf5 to csv or tsv files

I am looking for a sample code which can convert .h5 files to csv or tsv. I have to read .h5 and output should be csv or tsv. Sample code would be much appreciated,please help as i have stuck on it for last few days.I followed wrapper classes but…
Sanjay Tiwari
  • 221
  • 1
  • 2
  • 13
15
votes
3 answers

Pandas HDF5 as a Database

I've been using python pandas for the last year and I'm really impressed by its performance and functionalities, however pandas is not a database yet. I've been thinking lately on ways to integrate the analysis power of pandas into a flat HDF5 file…
prl900
  • 4,029
  • 4
  • 33
  • 40
15
votes
3 answers

Faster reading of time series from netCDF?

I have some large netCDF files that contain 6 hourly data for the earth at 0.5 degree resolution. There are 360 latitude points, 720 longitude points, and 1420 time points per year. I have both yearly files (12 GB ea) and one file with 110 years of…
David LeBauer
  • 31,011
  • 31
  • 115
  • 189
15
votes
5 answers

Saving dictionaries to file (numpy and Python 2/3 friendly)

I want to do hierarchical key-value storage in Python, which basically boils down to storing dictionaries to files. By that I mean any type of dictionary structure, that may contain other dictionaries, numpy arrays, serializable Python objects, and…
Gustav Larsson
  • 8,199
  • 3
  • 31
  • 51
15
votes
1 answer

Storing Pandas objects along with regular Python objects in HDF5

Pandas has a nice interface that facilitates storing things like Dataframes and Series in an HDF5: random_matrix = np.random.random_integers(0,10, m_size) my_dataframe = pd.DataFrame(random_matrix) store = pd.HDFStore('some_file.h5',complevel=9,…
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
15
votes
3 answers

g++ compile error: undefined reference to a shared library function which exists

I recently installed the hdf5 library on an ubuntu machine, and am now having trouble linking to the exported functions. I wrote a simple test script readHDF.cpp to explain the issue: #include int main(int argc, char * argv[]) { hid_t …
dermen
  • 5,252
  • 4
  • 23
  • 34
14
votes
1 answer

Difference between HDF5 file and PyTables file

Is there a difference between HDF5 files and files created by PyTables? PyTables has two functions .isHDFfile() and .isPyTablesFile() suggesting that there is a difference between the two formats. I've done some looking around on Google and have…
dtlussier
  • 3,018
  • 2
  • 26
  • 22
14
votes
1 answer

Floating Point Exception with Numpy and PyTables

I have a rather large HDF5 file generated by PyTables that I am attempting to read on a cluster. I am running into a problem with NumPy as I read in an individual chunk. Let's go with the example: The total shape of the array within in the HDF5 file…
Tarun Chitra
  • 241
  • 1
  • 4
14
votes
1 answer

Setting Attributes on Datasets using HDF5 C++ api

I'm using HDF5 C++ API in HDF5 1.8.7 and would like to use an H5::Attribute instance to set a couple of scalar attributes in an H5::DataSet instance, but cannot find any examples. It's pretty cut and dry using the C API: /* Value of the scalar…
Marc
  • 4,546
  • 2
  • 29
  • 45
14
votes
1 answer

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5. My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM.…
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234