I'm trying to find to what degree the chemical properties of a wine dataset influence the quality property of the dataset.
The error:
ValueError: y data is not in domain of logit link function. Expected domain: [0.0, 1.0], but found [3.0, 9.0]
The code:
import pandas as pd
from pygam import LogisticGAM
white_data = pd.read_csv("winequality-white.csv",sep=';');
X = white_data[[
"fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide",
"total sulfur dioxide","density","pH","sulphates","alcohol"
]]
print(X.describe)
y = pd.Series(white_data["quality"]);
print(white_quality.describe)
white_gam = LogisticGAM().fit(X, y)
The output of said code:
<bound method NDFrame.describe of fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.0 0.27 0.36 20.7 0.045
1 6.3 0.30 0.34 1.6 0.049
2 8.1 0.28 0.40 6.9 0.050
3 7.2 0.23 0.32 8.5 0.058
4 7.2 0.23 0.32 8.5 0.058
... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039
4894 6.6 0.32 0.36 8.0 0.047
4895 6.5 0.24 0.19 1.2 0.041
4896 5.5 0.29 0.30 1.1 0.022
4897 6.0 0.21 0.38 0.8 0.020
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 45.0 170.0 1.00100 3.00 0.45
1 14.0 132.0 0.99400 3.30 0.49
2 30.0 97.0 0.99510 3.26 0.44
3 47.0 186.0 0.99560 3.19 0.40
4 47.0 186.0 0.99560 3.19 0.40
... ... ... ... ... ...
4893 24.0 92.0 0.99114 3.27 0.50
4894 57.0 168.0 0.99490 3.15 0.46
4895 30.0 111.0 0.99254 2.99 0.46
4896 20.0 110.0 0.98869 3.34 0.38
4897 22.0 98.0 0.98941 3.26 0.32
alcohol
0 8.8
1 9.5
2 10.1
3 9.9
4 9.9
... ...
4893 11.2
4894 9.6
4895 9.4
4896 12.8
4897 11.8
[4898 rows x 11 columns]>
<bound method NDFrame.describe of 0 6
1 6
2 6
3 6
4 6
..
4893 6
4894 5
4895 6
4896 7
4897 6
Name: quality, Length: 4898, dtype: int64>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-e1c5720823a6> in <module>
16 print(white_quality.describe)
17
---> 18 white_gam = LogisticGAM().fit(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/pygam.py in fit(self, X, y, weights)
893
894 # validate data
--> 895 y = check_y(y, self.link, self.distribution, verbose=self.verbose)
896 X = check_X(X, verbose=self.verbose)
897 check_X_y(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/utils.py in check_y(y, link, dist, min_samples, verbose)
227 .format(link, get_link_domain(link, dist),
228 [float('%.2f'%np.min(y)),
--> 229 float('%.2f'%np.max(y))]))
230 return y
231
ValueError: y data is not in domain of logit link function. Expected domain: [0.0, 1.0], but found [3.0, 9.0]
The files: (I'm using Jupyter Notebook but I don't think you'd need to): https://drive.google.com/drive/folders/1RAj2Gh6WfdzpwtgbMaFVuvBVIWwoTUW5?usp=sharing