15

I am looking for a sample code which can convert .h5 files to csv or tsv. I have to read .h5 and output should be csv or tsv.

Sample code would be much appreciated,please help as i have stuck on it for last few days.I followed wrapper classes but don't know how to use that.I am not a good programmer so facing lot of problem.

please help thanks a lot in advance

Sanjay Tiwari
  • 221
  • 1
  • 2
  • 13

7 Answers7

4

Another python solution using pandas.

#!/usr/bin/env python3

import pandas as pd
import sys
fpath = sys.argv[1]
if len(sys.argv)>2:
    key = sys.argv[2]
    df = pd.read_hdf(fpath, key=key)
else:
    df = pd.read_hdf(fpath)

df.to_csv(sys.stdout, index=False)

This script is available here

First argument to this scrpt is hdf5 file. If second argument is passed, it is considered to be the name of column otherwise all columns are printed. It dumps the csv to stdout which you can redirect to a file.

For example, if your data is stored in hdf5 file called data.h5 and you have saved this script as hdf2df.py then

$ python3 hdf2df.py data.hf > data.csv

will write the data to a csv file data.csv.

Dilawar
  • 5,438
  • 9
  • 45
  • 58
2

You can also use h5dump -o dset.asci -y -w 400 dset.h5

  • -o dset.asci specifies the output file
  • -y -w 400 specifies the dimension size multiplied by the number of positions and spaces needed to print each value. You should take a very large number here.
  • dset.h5 is of course the hdf5 file you want to convert

This converts it to an ascii file, which is easy imported to excel, from where you can easily save it as a .csv (save as within excel, and specify file format). I did it a couple of times, and it worked for me. source

Mathias711
  • 6,568
  • 4
  • 41
  • 58
  • 3
    Hi Mathias, I followed what you said but not getting satisfactory answer.Problem with h5dump is that it gives data in hierarchical form and when we open it in excel it doesn't output as excepted.I am working on Million Song Dataset. – Sanjay Tiwari May 21 '14 at 11:06
  • Is the `-y -w 400` value high enough? It seems like a pretty huge database, and the number can be too low. If its just a simple table like you see in excel, it should work. What is wrong with the output in excel? I noticed that there are several options in Excel when importing a .asci file, maybe something there messes it all up – Mathias711 May 21 '14 at 11:13
  • 1
    Yes i used -y -w 800 and i was testing it on 377kb file before using on whole dataset.Input is in .h5 format as you know and i can guess it is in tabular form.It has 52 fields.I have sample data of 20 records and while comparing with output of ascii file,its completely different(In terms of format and not in data).I am just opening ascii file with Excel. – Sanjay Tiwari May 21 '14 at 12:13
1

Example of HDF5 to CSV conversion can be found at https://github.com/amgreenstreet/Million-Song-Dataset-HDF5-to-CSV

It uses Python and converts Million Songs Dataset from HDF5 to CSV format.

I strongly recommend to use Python(x,y) version http://python-xy.github.io/ because this example uses additional Python packages like NumPy and PyTables. Python(x,y) has these packages included.

  • As of today numpy and pytables are instantly installable though `pip install numpy pytables`. And Python(x,y) has been unmaintained since 2015 – Antony Hatchkins Jan 31 '18 at 04:01
1
import numpy as np
import h5py

with h5py.File('chunk0003.hdf5','r') as hf:
    print('List of arrays in this file: \n', hf.keys())
### This lists arrays in the file [u'_self_key', u'chrms1', u'chrms2', u'cuts1', u'cuts2', u'misc', u'strands1', u'strands2']

r1 = h5py.File('chunk0003.hdf5','r')
a = r1['chrms1'][:]
b = r1['chrms2'][:]
c = r1['cuts1'][:]
d = r1['cuts2'][:]
e = r1['strands1'][:]
f = r1['strands2'][:]
r1.close()
table=np.array([a,b,c,d,e,f])
table2=table.transpose()
np.savetxt('chunk0003.txt',table2,delimiter='\t')
0

Python:

import numpy as np
import h5py
np.savetxt(sys.stdout, h5py.File('foo.h5')['dataname'], '%g', ',')

Some notes:

  1. sys.stdout can be any file, or a file name string like "out.csv".
  2. %g is used to make the formatting human-friendly.
  3. If you want TSV just use '\t' instead of ','.
  4. I've assumed you have a single dataset name within the file (dataname).
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 1
    Hi John I followed your solution but it says AttributeError: 'module' object has no attribute 'savetxt'.i am using Numpy-1.8.1 and h5py-2.3 and Python 3.3. – Sanjay Tiwari May 21 '14 at 12:16
  • Can you just try saying `np.savetxt` in the REPL? http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html says this function exists in 1.8, and it definitely existed in 1.7 as well. Something must be very wrong with your setup or you typed it in wrong. – John Zwinck May 21 '14 at 12:20
  • 1
    Hi John I have gone through link you have provided.Sry to say I am not familiar with REPL language.I am not getting what could be the reason. My Code is as: import numpy as n import h5py file='C:\Users\user10\Desktop\foo' n.savetxt('example_output.csv', file, delimiter=',') – Sanjay Tiwari May 21 '14 at 12:37
  • The REPL is the Python interactive interpreter. Simply run "python" in your terminal, then say "import numpy" then "numpy.savetxt". What does it tell you? Does the function exist? – John Zwinck May 21 '14 at 12:49
  • 1
    It says "Traceback (most recent call last): File "", line 1, in numpy.savetxt AttributeError: 'module' object has no attribute 'savetxt' " I think problem is with Package.Can you suggest me link from where to actually download it. – Sanjay Tiwari May 21 '14 at 13:07
  • What does it say if you do `dir(numpy)`? Do you have a file like numpy.py in your working directory or something? I cannot imagine how you could have numpy but not savetxt. – John Zwinck May 21 '14 at 14:21
  • Hi John I resolved that issue.My script is running fine but it creating empty output file.My input file have data as i have verified that using h5dump.My Script is as below:- import numpy as n n.savetxt("output.csv",h5py.File('foo.h5')['C:\\Users\\user10\\Desktop\\'], '%g', ',') – Sanjay Tiwari May 21 '14 at 14:54
0

Using pandas HDFStore worked for me while read_hdf did not:

import h5py
import pandas as pd 

paths = []
with h5py.File('examples/test.h5','r') as hf:
    hf.visit(paths.append)
dt = pd.HDFStore('examples/test.h5').get(paths[1])
dt.to_csv('test.csv')
jsta
  • 3,216
  • 25
  • 35
0

If you don't know the data structure of the h5 file you can examine it by finding the first data key often a single list that holds another list of keywords or the labels of the actual data.

This example uses an h5 file of LA traffic data from: https://drive.google.com/drive/folders/10FOTa6HXPqX8Pf5WRoRwcFnW9BrNZEIX

Reading and exploring the unknown h5 file by it's keys. Here the first key is df that wraps the other lists such as axis0 and axis1:

import pandas as pd
import h5py

#h5 file path
filename = 'metr-la.h5'

#read h5 file
dataset = h5py.File(filename, 'r')

#print the first unkown key in the h5 file
print(dataset.keys())

#print the keys inside the first unkown key
df = dataset['df']
print(df.keys()) #prints sub list keys such as axis0 and axis1

#print the attributes of keys such as axis0 inside the first unkown key
print("axis0 data: {}".format(df['axis0']))
print("axis0 data attributes: {}".format(list(df['axis0'].attrs)))

Save the entire h5 file to csv with pandas HDFStore using the first key df:

import pandas as pd
import h5py

#save the h5 file to csv using the first key df
with pd.HDFStore(filename, 'r') as d:
    df = d.get('df')
    df.to_csv('metr-la.csv')

You can also save parts of the data using the different sub keys.

ThomasAFink
  • 1,257
  • 14
  • 25