MemoryError when computing SHAP values with large background dataset

Question

I'm trying to compute SHAP values for my model using a large background dataset, but I'm running into memory issues. Here's the error I'm encountering:

Using 32663 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
  0%|          | 0/32663 [00:28<?, ?it/s]
...
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 245. GiB for an array with shape (54698, 600257) and data type float64

I believe this is due to the size of the transformed_background_data. Here's the relevant code snippet:

# [Include the portion of your code here where you create and call the SHAP explainer]
shap_values = explainer.shap_values(transformed_background_data)

I understand the error suggests using shap.sample(data, K) or shap.kmeans(data, K), but I'm unsure about the implications of this or how to implement it correctly. Could someone provide guidance on these methods or suggest other ways to efficiently compute the SHAP values with large datasets?

Seems like the array you are creating is very large, does it need to be that large? Can you use a different data structure? — Ftagliacarne, Aug 08 '23 at 12:40
yes its large i know it , i have a 256GB ram . i want found a way to solve it @ftagliacarne — MelinA, Aug 09 '23 at 05:42
In the background data that is used to estimate the expected_value of the SHAP explainer, you include only of subset of randomly selected sample from the training data so not all the training data. — user2586955, Aug 10 '23 at 11:11
@MichaelM i dont wan randomly subsample . i want all of data — MelinA, Aug 13 '23 at 05:48

MemoryError when computing SHAP values with large background dataset

0 Answers0