-2

I'm trying to compute SHAP values for my model using a large background dataset, but I'm running into memory issues. Here's the error I'm encountering:

Using 32663 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
  0%|          | 0/32663 [00:28<?, ?it/s]
...
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 245. GiB for an array with shape (54698, 600257) and data type float64

I believe this is due to the size of the transformed_background_data. Here's the relevant code snippet:

# [Include the portion of your code here where you create and call the SHAP explainer]
shap_values = explainer.shap_values(transformed_background_data)

I understand the error suggests using shap.sample(data, K) or shap.kmeans(data, K), but I'm unsure about the implications of this or how to implement it correctly. Could someone provide guidance on these methods or suggest other ways to efficiently compute the SHAP values with large datasets?

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
MelinA
  • 1
  • 8
  • Seems like the array you are creating is very large, does it need to be that large? Can you use a different data structure? – Ftagliacarne Aug 08 '23 at 12:40
  • yes its large i know it , i have a 256GB ram . i want found a way to solve it @ftagliacarne – MelinA Aug 09 '23 at 05:42
  • 1
    In the background data that is used to estimate the expected_value of the SHAP explainer, you include only of subset of randomly selected sample from the training data so not all the training data. – user2586955 Aug 10 '23 at 11:11
  • Just randomly subsample 200 - 1000 rows as background data. – Michael M Aug 12 '23 at 16:09
  • @MichaelM i dont wan randomly subsample . i want all of data – MelinA Aug 13 '23 at 05:48

0 Answers0