Sorting large numbers of galaxies into spheres of a certain radius

Question

I have a large number of galaxies. I need to sort these galaxies into a spheres of radius N, calculate the average numbers of galaxies in each sphere, and plot a graph of this against radius N.

The galaxies are stored in a .fits file as radial coordinates (right ascension, declination and redshift). I'm using pyFITS and astropy to convert the galaxies coordinates into cartesian coordinates with Earth at (0,0,0) and then store the coordinates in a numpy array with the structure: ((x,y,z),(x1,y1,z1),etc.)

In order to seperate the galaxies into spheres of radius N, I am randomly selecting a galaxy from the array, then iterating through the array calcuating the distance between the randomly selected galaxy and the current galaxy. If the distance is less than or equal to the radius, it is added to the sphere. This is repeated as many times as the number of bubbles that need to be calculated.

My current method for this is really slow. I'm unfamiliar with numpy (I've been figuring things out as I'm going along), and I can't really see a better method than just iterating through all the galaxies.

Is there a way to do this any faster (something to do with numpy arrays - I'm converting them to a normal python list right now)? This is what I'm doing right now (https://github.com/humz2k/EngineeringProjectBethe/blob/humza/bubbles.py).

m13op22 · Accepted Answer · 2019-05-22T13:15:55.033

First, it's generally better to post samples of your code in your question where your issue is (such as the part where you select the radii you want to keep), rather than links to your entire script :)

Second, numpy arrays are great for scientific programming! They allow you to easily store data and perform matrix operations on that data without have to loop through the native Python lists. If you know MATLAB, they basically allow you to do most of the same things MATLAB's arrays do. Some more information can be found here and here. pandas dataframes are also good to use.

On to your code. At the end of your read_data function, you can combine some of those coordinates statements and probably don't need to add the tolist() because it's a numpy.array (which is faster and uses less memory, see the links above).

In your get_bubbles function, I don't think you need to make copies of the data. The copies will also take up memory. The biggest issue I see here is using the variable i twice in your loops. That's bad because i is replaced in the second loop. For example,

for i in [1, 2, 3, 4]:

for i in np.array([5, 6, 7, 8]):
    print(i)

print 5, 6, 7, 8 four times. It's also bad because we can't tell which i does what you want (having no comments doesn't help either ;) ). Replace the i variable in the second loop with another variable, like j.

Here are two options to make lists faster: list comprehensions and initializing numpy.arrays. You can read about list comprehensions here. An example of initializing numpy.arrays is

new_data = np.zeros(len(data))

for i in range(len(data)):
     new_data[i] = data[i]

Finally, you could create a separate array for the radii and look into using numpy.where to select the indexes of the radii that match your criteria.

That was kind of a lot, hope it helps.

Wow that was some detailed help. Thanks! – Humza Qureshi Feb 24 '19 at 11:18 — Humza Qureshi, Feb 24 '19 at 11:18

Sorting large numbers of galaxies into spheres of a certain radius

1 Answers1