2

I am using some time-series power consumption data and trying to do a linear regression analysis on it.

The data has the following columns:

Date, Denmark_consumption, Germany_consumption, Czech_consumption, Austria_consumption.

It is time-series data with a frequency of hours.

There are, however, values for each column that are NaN's My goal is to create a linear regression model that does training and testing on a subset of the data that has no null values and then try to predict a value for the Denmark consumption column, for example, that currently has a NaN value.

I am planning to use as my training/testing features one country consumption column as well as the date in ordinal values to try and predict a second countries consumption value.

Here is an example of the data.

Date                   Denmark    Germany    Czech   Austria

2018-01-01 00:00:00     1607.0  42303.0     5520    6234.0

2018-01-01 01:00:00     1566.0  41108.0     5495    6060.0

2018-01-01 02:00:00     1460.0  40554.0     5461    5872.0

2018-01-01 03:00:00     1424.0  38533.0     5302    5564.0

2018-01-01 04:00:00     1380.0  38494.0     5258    5331.0

I did a couple of things.

  1. I dropped rows with any null values to create my training and testing dataset.

  2. I set the date column to be the data frame index.

  3. I upsampled the data from hourly to weekly. I used the default 'mean' aggregate function.

  4. I added the date back in as a column into the training and testing data and converted it to ordinal values.

  5. Because the various consumption values are all highly correlated, I only used the Germany consumption column for the X_train and X_test dataset

I created a linear regression model using sklearn and fit the data using the Germany consumption and ordinal date as my 'X' and the Denmark consumption as my 'Y'.

I attempted to plot the output via a scatter plot and line, but I'm getting a graph that looks like this:

enter image description here

Why does my plot look like somebody scribbled lines all over it? I was expecting a single line of some sort.

Here is an example of my x_train dataset

                        Germany    Date
                      consumption
Date                                   
2018-07-08         44394.125000  736883
2019-01-16         66148.125000  737075
2019-08-03         45718.083333  737274
2019-06-09         41955.250000  737219
2020-03-04         61843.958333  737488

Here is an example of my y_train dataset.

Date
2018-01-01    1511.083333
2018-01-02    1698.625000
2018-01-03    1781.291667
2018-01-04    1793.458333
2018-01-05    1796.875000
Name: Denmark_consumption, dtype: float64

Here's the actual relevant code.

lin_model = LinearRegression()
lin_model.fit(X_train,y_train)
y_pred = lin_model.predict(X_test)
plt.scatter(X_test['Date'].map(dt.datetime.fromordinal),y_pred,color='black')
plt.plot(X_test['Date'],y_pred)

The coefficients, R squared and mean squared error are:

Coefficients: 
 [0.01941453 0.01574128]
Mean squared error: 14735.12
Coefficient of determination: 0.51

Can someone let me know what I'm not doing correctly? Also, is my methodology accurate? Does it make sense to try and predict consumption values from a combination of a second country's consumption + the date?

Any help appreciated.

redmage123
  • 413
  • 8
  • 15

1 Answers1

2

Your methodology is complicated, but doable. Personally I think it's probably easier to create a linear mapping between Germany's dates and Germany's consumption, then try to make predictions for Denmark's consumption from their dates that way.

However, sticking with your method, you should keep in mind that there are two independent variables (Germany's date converted to an integer, and Germany's consumption) and Denmark's consumption depends on these two variables. So by plotting the test dates against the predictions in a 2D plot like you are now, you are actually missing the consumption variable. What you should be plotting is Germany's date, and Germany's consumption against Denmark's consumption in a 3D plane.

Also you shouldn't expect to get a line: with multiple linear regression and two independent variables you are predicting a plane.

Here is a brief example I put together which is similar to what you are probably trying to achieve. Feel free to change the format of the date as needed.

import pandas as pd
import numpy as np
import datetime as dt
from mpl_toolkits.mplot3d import *
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.linear_model import LinearRegression

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# starts 2018/11/02
df_germany = pd.DataFrame({
    'Germany consumption': [45000, 47000, 48000, 42000, 50000],
    'Date': [737000, 737001, 737002, 737003, 737004]})
df_germany_test = pd.DataFrame({
    'Germany consumption': [42050, 42000, 57000, 30000, 52000, 53000],
    'Date': [737000, 737001, 737002, 737003, 737004, 737005]})
df_denmark = pd.DataFrame({
    'Denmark consumption':  [1500, 1600, 1700, 1800, 2000]
    })

X_train = df_germany.to_numpy()
y_train = df_denmark['Denmark consumption']

# make X_test the same as X_train to make sure all points are on the plane
# X_test = df_germany

# make X_test slightly different
X_test = df_germany_test

lin_model = LinearRegression()
lin_model.fit(X_train,y_train)
y_pred = lin_model.predict(X_test)

fig = plt.figure()
ax = fig.gca(projection='3d')          
# plt.hold(True)

x_surf=np.linspace(min(X_test['Date'].values), max(X_test['Date'].values), num=20)               
y_surf=np.linspace(min(X_test['Germany consumption'].values), max(X_test['Germany consumption'].values), num=20)
x_surf, y_surf = np.meshgrid(x_surf, y_surf)
b0 = lin_model.intercept_
b1, b2 = lin_model.coef_ 
z_surf = b0+ b2*x_surf + b1*y_surf
ax.plot_surface(x_surf, y_surf, z_surf, cmap=cm.cool, alpha = 0.2)    # plot a 3d surface plot

ax.scatter(X_test['Date'].values, X_test['Germany consumption'].values, y_pred, alpha=1.0)
plt.show()

enter image description here

Derek O
  • 16,770
  • 4
  • 24
  • 43
  • 1
    Thank you very much for this. This really helped a lot. – redmage123 Jul 23 '20 at 05:53
  • I am glad to hear it! If it was helpful, please consider [accepting the answer](https://stackoverflow.com/help/someone-answers) so that anyone else with a similar question knows where to start :) – Derek O Jul 23 '20 at 06:09