0

I have a column: "Rented Bike Count" in my data frame, which is the dependent variable of my linear regression project. I found that its distribution is highly skewed, so I wanted to transform it into a more normal distribution as a data preparation step for linear regression.

enter image description here

But when I used Log transformation for the same using: sns.distplot(np.log10(temp_df['Rented Bike Count']))

It showed me the following error:

enter image description here

"Rented Bike Count" is already of int data type. Can anyone help?

  • share `temp_df.info()`, `dtypes` of columns and small reproducible `df`, so that other can copy code, test and check.Do not paste images of code errors. Check- https://stackoverflow.com/help/how-to-ask – Divyank Sep 16 '22 at 07:03

1 Answers1

0

This is speculative as no data is provided, but from the histogram you show I assume you have zero values in temp_df['Rented Bike Count']. You cannot calculate the logarithm of 0, which makes numpy return -inf (instead of throwing an error). In order to plot the data you would need to replace these infinite values first.

max_jump
  • 296
  • 1
  • 7
  • Yes thank you, I have 500 values as '0' in the Rented Bike Column. But I'm skeptical of replacing them, as they are representing a holiday. Meaning people are renting bikes only on functioning-days. This is an important relationship when it comes to my Linear Regression Model. I tried doing log(x+1) but it is far from a normal distribution. Moreover, I'm facing issues when calculating metrics like Mean Squared Error or R2 score. It says "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')" – Pooja Dalwani Sep 23 '22 at 05:10
  • That's not really a topic for Stackoverflow, but you may want to try different transformations. Have you considered other approaches, e.g. using square root or Box-Cox transformation (cf. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html)? – max_jump Sep 23 '22 at 05:58