11

I have dataframe with columns Year, month, day,hour, minute, second, Daily_KWH. I need to predict Daily KWH using neural netowrk. Please let me know how to go about it

      Daily_KWH_System  year  month  day  hour  minute  second
0          4136.900384  2016      9    7     0       0       0
1          3061.657187  2016      9    8     0       0       0
2          4099.614033  2016      9    9     0       0       0
3          3922.490275  2016      9   10     0       0       0
4          3957.128982  2016      9   11     0       0       0

I'm getting the Value Error, when I'm fitting the model.

code so far:

X = df[['year','month','day','hour','minute','second']]
y = df['Daily_KWH_System']

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)

#y_train.shape
#X_train.shape

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))

#y_train = np.asarray(df['Daily_KWH_System'], dtype="|S6") 

mlp.fit(X_train,y_train)

Error:

ValueError: Unknown label type: (array([  2.27016856e+02,   3.02173014e+03,   4.29404190e+03,
     2.41273427e+02,   1.76714247e+02,   4.23374425e+03,
Anagha
  • 3,073
  • 8
  • 25
  • 43

4 Answers4

14

First of all, this is a regression problem and not a classification problem, as the values in the Daily_KWH_System column do not form a set of labels. Instead, they seem to be (at least based on the provided example) real numbers.

If you want to approach it as a classification problem regardless, then according to sklearn documentation:

When doing classification in scikit-learn, y is a vector of integers or strings.

In your case, y is a vector of floats, and therefore you get the error. Thus, instead of the line

y = df['Daily_KWH_System']

write the line

y = np.asarray(df['Daily_KWH_System'], dtype="|S6")

and this will resolve the issue. (You can read more about this approach here: Python RandomForest - Unknown label Error)

Yet, as regression is more appropriate in this case, then instead of the above change, replace the lines

from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))

with

from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))

The code will run without throwing an error (but there certainly isn't enough data to check whether the model that we get performs well).

With that being said, I don't think that this is the right approach for choosing features for this problem.

In this problem we deal with a sequence of real numbers that form a time series. One reasonable feature that we could choose is the number of seconds (or minutes\hours\days etc) that passed since the starting point. Since this particular data contains only days, months and years (other values are always 0), we could choose as a feature the number of days that passed since the beginning. Then your data frame will look like:

      Daily_KWH_System  days_passed 
0          4136.900384    0   
1          3061.657187    1     
2          4099.614033    2  
3          3922.490275    3   
4          3957.128982    4  

You could take the values in the column days_passed as features and the values in Daily_KWH_System as targets. You may also add some indicator features. For example, if you think that the end of the year may affect the target, you can add an indicator feature that indicates whether the month is December or not.

If the data is indeed daily (at least in this example you have one data point per day) and you want to tackle this problem with neural networks, then another reasonable approach would be to handle it as a time series and try to fit recurrent neural network. Here are couple of great blog posts that describe this approach:

http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

http://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/

Miriam Farber
  • 18,986
  • 14
  • 61
  • 76
2

The fit() function expects y to be 1D list. By slicing a Pandas dataframe you always get a 2D object. This means that for your case, you need to convert the 2D object you got from slicing the DataFrame into an actual 1D list, as expected by fit function:

y = list(df['Daily_KWH_System'])
zeebonk
  • 4,864
  • 4
  • 21
  • 31
  • This will not resolve the problem. If you want to handle it as a classification problem, you should specify the type of Y (similarly to this answer: stackoverflow.com/questions/34246336/…). However, based on the values in Daily_KWH_System, this shouldn't be a classification problem, but rather a regression problem (see more details in my answer). – Miriam Farber Jul 20 '17 at 08:51
1

Use a regressor instead. This will solve float 2D data issue.

from sklearn.neural_network import MLPRegressor   
model = MLPRegressor(solver='lbfgs',alpha=0.001,hidden_layer_sizes=(10,10))

model.fit(x_train,y_train)

y_pred = model.predict(x_test)
-1

Instead of mlp.fit(X_train,y_train) use this mlp.fit(X_train,y_train.values)

Chandra
  • 96
  • 1
  • 10
  • This will not resolve the error. See my comment to zeebonk – Miriam Farber Jul 20 '17 at 07:17
  • Your solution to use regression instead of classification is correct in this case. However, programming point of view, every number can be class label, if labels i.e. y_train can be array of numbers and still represent array of labels......so even MLPClassifier should work in programming point of view, though functionally its not a good idea to use classifier in this kind of data.. – Chandra Jul 20 '17 at 08:24
  • In such case one needs to specify the type of Y accordingly, as in this answer: https://stackoverflow.com/questions/34246336/python-randomforest-unknown-label-error. Just converting it into list of values will not resolve the issue. – Miriam Farber Jul 20 '17 at 08:36