Get random subsample from pyfits data table

Question

I have a very simple question, but Google does not seem to be able to help me here. I want a subsample of a pyfits table... basically just remove 90% of the rows, or something like that. I read the table with:

data_table = pyfits.getdata(base_dir + filename)

I like the pyfits table organization where I access a field with data_table.field(fieldname), so I would like to keep the data structure, but remove rows.

For what it's worth, you don't even need to use `.field(fieldname)`--you can just use subscript syntax like `data_table[fieldname]` (where as `data_table[x]` where `x` is an integer returns a row of the table). Also there isn't much special about this unique to pyfits--it's just a glorified [`numpy.recarray`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.recarray.html) — Iguananaut, Oct 04 '17 at 16:31

MSeifert · Accepted Answer · 2017-09-27T20:31:55.900

You can use numpy.random.choice to create an array containing several random choices from another array.

In your case you want "x" rows from your data_table. You can't directly use choice on the Table but you can use the len of your table for random.choice:

import numpy as np
rows_numbers_to_keep = np.random.choice(len(data_table), 2, replace=False)

And then index your table:

subsample = data_table[rows_numbers_to_keep]

For example (I'm using astropy because PyFITS isn't developed anymore and has been migrated to astropy.io.fits):

>>> data
FITS_rec([(1, 4, 7), (2, 5, 8), (3, 6, 9), (4, 7, 0)],
         dtype=(numpy.record, [('a', 'S21'), ('b', 'S21'), ('c', 'S21')]))

>>> data[np.random.choice(len(data), 2, replace=False)]  # keep 2 distinct rows
FITS_rec([(1, 4, 7), (4, 7, 0)],
         dtype=(numpy.record, [('a', 'S21'), ('b', 'S21'), ('c', 'S21')]))

If you want to allow getting the same row several times you can use replace=True instead.

Get random subsample from pyfits data table

1 Answers1