I am using some time-series power consumption data and trying to do a linear regression analysis on it.
The data has the following columns:
Date, Denmark_consumption, Germany_consumption, Czech_consumption, Austria_consumption.
It is time-series data with a frequency of hours.
There are, however, values for each column that are NaN's My goal is to create a linear regression model that does training and testing on a subset of the data that has no null values and then try to predict a value for the Denmark consumption column, for example, that currently has a NaN value.
I am planning to use as my training/testing features one country consumption column as well as the date in ordinal values to try and predict a second countries consumption value.
Here is an example of the data.
Date Denmark Germany Czech Austria
2018-01-01 00:00:00 1607.0 42303.0 5520 6234.0
2018-01-01 01:00:00 1566.0 41108.0 5495 6060.0
2018-01-01 02:00:00 1460.0 40554.0 5461 5872.0
2018-01-01 03:00:00 1424.0 38533.0 5302 5564.0
2018-01-01 04:00:00 1380.0 38494.0 5258 5331.0
I did a couple of things.
I dropped rows with any null values to create my training and testing dataset.
I set the date column to be the data frame index.
I upsampled the data from hourly to weekly. I used the default 'mean' aggregate function.
I added the date back in as a column into the training and testing data and converted it to ordinal values.
Because the various consumption values are all highly correlated, I only used the Germany consumption column for the X_train and X_test dataset
I created a linear regression model using sklearn and fit the data using the Germany consumption and ordinal date as my 'X' and the Denmark consumption as my 'Y'.
I attempted to plot the output via a scatter plot and line, but I'm getting a graph that looks like this:
Why does my plot look like somebody scribbled lines all over it? I was expecting a single line of some sort.
Here is an example of my x_train dataset
Germany Date
consumption
Date
2018-07-08 44394.125000 736883
2019-01-16 66148.125000 737075
2019-08-03 45718.083333 737274
2019-06-09 41955.250000 737219
2020-03-04 61843.958333 737488
Here is an example of my y_train dataset.
Date
2018-01-01 1511.083333
2018-01-02 1698.625000
2018-01-03 1781.291667
2018-01-04 1793.458333
2018-01-05 1796.875000
Name: Denmark_consumption, dtype: float64
Here's the actual relevant code.
lin_model = LinearRegression()
lin_model.fit(X_train,y_train)
y_pred = lin_model.predict(X_test)
plt.scatter(X_test['Date'].map(dt.datetime.fromordinal),y_pred,color='black')
plt.plot(X_test['Date'],y_pred)
The coefficients, R squared and mean squared error are:
Coefficients:
[0.01941453 0.01574128]
Mean squared error: 14735.12
Coefficient of determination: 0.51
Can someone let me know what I'm not doing correctly? Also, is my methodology accurate? Does it make sense to try and predict consumption values from a combination of a second country's consumption + the date?
Any help appreciated.