Why PowerTransformer raises FloatingPointError by given non zero data

Question

from sklearn.preprocessing import PowerTransformer
transformer = PowerTransformer(method='yeo-johnson', standardize=True)
arr = [330117.5,
 651193.35,
 364335.63,
 2136036.01,
 1184539.05,
 1186871.87,
 2310647.36,
 860183.78,
 237451.79,
 2324365.47,
 1942665.42,
 1441017.74,
 1214875.44,
 530633.22,
 2528684.53,
 371882.3,
 400359.28,
 798128.31,
 2458850.02,
 35565.16,
 655361.06,
 979121.35,
 2455851.58,
 656799.58,
 551429.2,
 122855.01,
 714573.03,
 1065608.98,
 656657.61,
 327573.11,
 697887.49,
 3853463.06,
 60303.21,
 778135.06,
 509140.84,
 617577.08,
 2112523.9,
 164003.18,
 484017.51,
 1250302.48,
 2342622.41,
 349077.45,
 1069976.02,
 1005329.1,
 836722.74,
 1126835.94,
 6773842.44,
 554150.9,
 18207498.84,
 2413814.68,
 3056937.64,
 1493907.08,
 420165.71,
 424720.48,
 506684.87,
 3138440.77,
 4737292.56,
 6619302.87,
 178811.87,
 1931526.68,
 155927053.78,
 735076.02,
 20403952.84,
 2712149.03,
 329014.58,
 894241.92,
 966598.77,
 1105177.67,
 1122957.48,
 3435244.08,
 3485325.79,
 1424915.64,
 684150.05,
 977746.26,
 37386.1,
 1616938.1,
 1517666.31,
 753096.39]

df_test = pd.DataFrame(np.array(arr), columns = ['Column_A'])

standardized = transformer.fit_transform(df_test[["Column_A"]]).reshape(-1)

df_test.loc[:, "Column_A_std"] = pd.Series(standardized, index=df_test.index, name="Column_A_std")

df_test.head()

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_data.py in _neg_log_likelihood(lmbda) 2980 n_samples = x.shape[0] 2981 -> 2982 loglike = -n_samples / 2 * np.log(x_trans.var()) 2983 loglike += (lmbda - 1) * (np.sign(x) * np.log1p(np.abs(x))).sum() 2984

FloatingPointError: divide by zero encountered in log

score 1 · Answer 1 · answered Jul 23 '21 at 23:53

This appears to be a numerical precision issue. When optimizing the lambda using scipy.optimize.brent, some lambdas create a constant transformed data, which breaks the log-likelihood calculation, and such a lambda gets chosen as "optimal".

Scaling your original data by, say, 100000 fixes the issue. I don't think the scale of the original should make a difference in the Yeo-Johnson transformation; checking that by scaling by other numbers, as low as ~100, also fixes the problem, and produces output that has correlation of 0.995 with other scaling amounts.

Why PowerTransformer raises FloatingPointError by given non zero data

1 Answers1