importing illustris hdf5 file into numpy array using h5py

Question

I am working on importing a rather large hdf5 file of illustris galaxy simulation code using h5py. I have the file displayed here if anyone wants to see - it is 1.96 GB.

https://drive.google.com/file/d/0B1Kj475OJBnuaFBIS2FhTFpvNkk/view?usp=sharing

I want to have h5py and numpy show tables of data, use numpy.sum to sum columns and output vectors, and tell the file to only extract certain files I want. This is a A 2D table the size 16X56 for each galaxy. It is a large dataset that contains data for millions of galaxies (almost 2GB of data) with over a million rows.

Along dimension size 56 of the table: each bin represented age. Summing along the dimension size 16 gives you a 1 dimensional vector of size 56 for each galaxy which represents stellar mass (in units of 1e^10 M(sun) formed within each age bin.

I am aiming to use python and h5py to:

1- use numpy to view the array of data and sum along the dimension size 16 to get the 56 vectors) displayed in python

2- have numpy eliminate the steady stellar formation rates to specifically extract galaxies that had a sudden starburst between 1 Gyr ago and 2 Gyr and then stopped - is there a way to do this? This would eliminate a huge amount of galaxies that i would have to look through. This is to relate to E+A galaxies which experienced a sudden starburst and then stopped.

The age bins that will be displayed once the vectors are summed through numpy are as follows:

in Gyr

<0.005, 
0.005 - 0.015, 
0.015 - 0.025, 
0.025 - 0.035, 
0.035 - 0.045, 
0.045 - 0.055, 
0.055 - 0.065, 
0.065 - 0.075, 
0.075 - 0.085, 
0.085 - 0.095, 
0.095 - 0.125, 
0.125 - 0.175,
0.175 - 0.225,
0.225 - 0.275,
0.275 - 0.325,
0.325 - 0.375,
0.375 - 0.425,
0.425 - 0.475,
0.475 - 0.55,
0.55 - 0.65,
0.65 - 0.75,
0.75 - 0.85,
0.85 - 0.95,
0.95 - 1.125,
1.125 - 1.375,
1.375 - 1.625,
1.625 - 1.875,
1.875 - 2.125,
2.125 - 2.375,
2.375 - 2.625,
2.625 - 2.875,
2.875 - 3.125,
3.125 - 3.375,
3.375 - 3.625,
3.625 - 3.875,
3.875 - 4.25,
4.25 - 4.75,
4.75 - 5.25,
5.25 - 5.75,
5.75 - 6.25,
6.25 - 6.75,
6.75 - 7.25,
7.25 - 7.75,
7.75 - 8.25,
8.25 - 8.75,
8.75 - 9.25,
9.25 - 9.75,
9.75 - 10.25,
10.25 - 10.75,
10.75 - 11.25,
11.25 - 11.75,
11.75 - 12.25,
12.25 - 12.75,
12.75 - 13.25,
13.25 - 13.75, >13.75.

I know how to read the meaning of the data, but since I'm an amateur at using hdf5 files in coding and coding in general, I'm having trouble figuring out the specific commands to get h5py and numpy to sum along the dimensions I want, display the vectors, etc.

Is there anyone with experience that knows how to do this?

Thank you,

Winonah

You have one dataset of size (56, 4366546). So how is your data really organized? Does for exampe [:,0:16] represent one bin? — max9111, Jul 26 '17 at 16:20
First question - have you successfully loaded one or more datasets from the file? — hpaulj, Jul 26 '17 at 16:21
I have done these commands f = h5py.File('SFRH_binsv2_for_ClaireDickey_L75n1820FP135_Zcollapsed.hdf5', 'r') data = f.get('/Users/Jim/Documents/illustris_python/SFRH_binsv2_for_ClaireDickey_L75n1820FP135_Zcollapsed.hdf5') data_as_array = np.array(data) But I haven't loaded them into a dataset yet, thats what i'm trying to find the commands to do — W. Ojanen, Jul 26 '17 at 16:43
Please edit the question if you have further information. All I can see is that the first dimension is 56, which looks familiar which you described. Also the file name ("Z_collapsed") may give some hints. What is the text file about? The simplest solution would be that the 3 axis was collapsed with could be restored with a simple reshape. But why is then second axis not divideable by 16? You have to ask the person, who have given the HDF5-File to you. — max9111, Jul 26 '17 at 20:37

score 0 · Answer 1 · answered Jul 26 '17 at 17:24

How to open a dataset

import numpy as np
import h5py

f=h5py.File("yourFilName") #Open the file
f.keys() #Get the names of the datasets
# there is one dataset called 'FormedStellarMass'
dset=f['FormedStellarMass'] #Open this dataset
shape_of_Array=dset.shape #Gives you the shape of the array
# (56, 4366546)

This obviously different from what you are expecting. A further explanation is needed here. The last dimension is also not a whole number divisible by 16. I have investigated the data a bit.

dset[55,n:n+16] #n=0,16,32.... gives zeros

if you decrease the index in the first dimension the data looks to be increasing

dset[55,0:16]
dset[54,0:16] 
dset[53,0:16]

At the end of your dataset there seems to be only zeros.

dset[:,dset.shape[1]-17:dset.shape[1]-1]

Please correct your question and explain how your data is really organized. Are there incomplete chunks of data in the dataset?

The code was sent to me in that exact form stating it was a 16X56 array for each galaxy. if it doesn't look that way then then something must have happened with the formatting. — W. Ojanen, Jul 26 '17 at 20:02
link : https://drive.google.com/file/d/0B1Kj475OJBnuTGpGby1iRTd4Yzg/view?usp=sharing link to text file: https://drive.google.com/file/d/0B1Kj475OJBnuaWgzYVRHUnhQdkU/view?usp=sharing — W. Ojanen, Jul 26 '17 at 20:03

score 0 · Answer 2 · answered Jul 26 '17 at 20:08

The code was sent to me in that exact form stating it was a 16X56 array for each galaxy. if it doesn't look that way then then something must have happened with the formatting. link : https://drive.google.com/file/d/0B1Kj475OJBnuTGpGby1iRTd4Yzg/view?usp=sharing link to text file: https://drive.google.com/file/d/0B1Kj475OJBnuaWgzYVRHUnhQdkU/view?usp=sharing

    f.keys() #Get the names of the datasets
# there is one dataset called 'FormedStellarMass'
dset=f['FormedStellarMass'] #Open this dataset
shape_of_Array=dset.shape #Gives you the shape of the array
(56, 4366546) 
dset[1,0:56]

    array([  5.90118565e-04,   0.00000000e+00,   2.02415307e-03,
         1.97571842e-03,   9.65413419e-05,   0.00000000e+00,
         3.54059404e-04,   0.00000000e+00,   0.00000000e+00,
         4.17659608e-03,   0.00000000e+00,   0.00000000e+00,
         1.16594089e-03,   0.00000000e+00,   0.00000000e+00,
         5.53713227e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   4.25061543e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   2.06665633e-03,
         2.65859143e-03,   1.31853058e-03,   6.08496601e-04,
         0.00000000e+00,   1.06139400e-03,   4.08286955e-03,
         3.30939831e-03,   2.52765852e-03,   2.83553329e-03,
         3.42842575e-03,   0.00000000e+00,   1.43292415e-03,
         1.41140283e-03,   1.46918988e-03,   0.00000000e+00,
         4.53191859e-03,   1.17285761e-03,   4.20416283e-04,
         0.00000000e+00,   2.76596771e-03,   2.18793727e-03,
         0.00000000e+00,   2.99561035e-03,   9.63958330e-04,
         1.64320586e-03,   1.38792950e-03,   2.01430215e-04,
         0.00000000e+00,   1.59171385e-03])

importing illustris hdf5 file into numpy array using h5py

2 Answers2