Fitting a non-linear univariate regression to time-series data in Python

Question

I've recently started machine learning using python. Below is a dataset I picked up as an example along with the codes I've worked on till now. Chosen [2000....2015] as the test data and train data [2016, 2017].

Dataset  
      Years        Values
    0    2000      23.0
    1    2001      27.5
    2    2002      46.0
    3    2003      56.0
    4    2004      64.8
    5    2005      71.2
    6    2006      80.2
    7    2007      98.0
    8    2008     113.0
    9    2009     155.8
    10   2010     414.0
    11   2011    2297.8
    12   2012    3628.4
    13   2013   16187.8
    14   2014   25197.8
    15   2015   42987.8
    16   2016   77555.5
    17   2017  130631.9

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame([[i for i in range(2000,2018)], 
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])


df = df.T
df.columns = ['Years', 'Values']

The above code creates the DataFrame. Another important thing to keep in mind is that my Years column is a TIME-SERIES and not just a continuous value. I haven't made any changes to accomodate this.

I'm want to fit non-linear models that may help and print the plots like I've done for my linear model example. Here is what I've tried using a linear model. Also, in my own example, I do not seem to be accounting for the fact that my Years column is a time series and NOT continuous.

Once, we've the model, would like to use that for predicting values for next couple of years atleast.

X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0, shuffle = False)
lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, lm.predict(X_train), color = 'blue')
plt.title('Years vs Values (training set)')
plt.xlabel('Years')
plt.ylabel('Values')
plt.show()

Don't understand what you want! Do you want non-linear regression? or do you want to know how to fir your output to `fit my output to the X_train and Y_train data` because you already seem to have done that! — ababuji, Jul 01 '18 at 11:19
Hi Abhishek, I need a non-linear regression. I've already tried `SVM(kernel = 'poly')` but didn't work. can you help? — PratikSharma, Jul 01 '18 at 11:21
Alright, can you also do `DataFramename.dtypes`, and tell me what you get? — ababuji, Jul 01 '18 at 11:21
Here is a step ahead on what I was able to do based on my limited understanding on the matter. [link] (https://stackoverflow.com/questions/51122688/scipy-optimal-parameters-not-found-number-of-calls-to-function-has-reached-maxf) — PratikSharma, Jul 01 '18 at 11:26
Check it out! I'm done with the solution. Use a RandomForest Regressor. It just works. It may overfit though. — ababuji, Jul 01 '18 at 11:53
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174113/discussion-between-abhishek-and-greenarrow). — ababuji, Jul 01 '18 at 12:23
I will set a bounty in two days (sacrificing a bit of MY reputation, so this question gets more attention from experts). This is a lot more complex than you think. — ababuji, Jul 01 '18 at 12:28
What you're asked for a completely separate domain called time-series regression` — ababuji, Jul 01 '18 at 12:31
But on community I saw the posts where mostly marked either under 'linear regression' or 'non-linear regression'. — PratikSharma, Jul 01 '18 at 12:33
Yes, but your linear or non-linear regression is VERY SPECIFIC for Time-Series data. It is 100% linear-regression or non-linear regression. But the way you apply it for time-series data varies compared to how it's conventionally done for continuous input-output — ababuji, Jul 01 '18 at 12:35
ah! okay! any source where it is easy to understand the time-series using python? — PratikSharma, Jul 01 '18 at 12:36
if I had known that, I would've read it, understood it, and answered by now! ;) — ababuji, Jul 01 '18 at 13:10

score 2 · Answer 1 · answered Jul 02 '18 at 12:36

Try this. You can print predicted values as well. Predicted for 5 years.

import numpy.polynomial.polynomial as poly
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

df = pd.DataFrame([[i for i in range(2000,2018)],
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])
df = df.T
df.columns = ['Year', 'Values']
df['Year'] = df['Year'].astype(int)
df['Values'] = df['Values'].astype(int)
no_of_predictions = 5


X = np.array(df.Year, dtype = float)
y = np.array(df.Values, dtype = float)
Z = [2019,2020,2021,2022]
coefs = poly.polyfit(X, y, 4)
X_new = np.linspace(X[0], X[-1]+no_of_predictions, num=len(X)+no_of_predictions)
ffit = poly.polyval(X_new, coefs)
pred = poly.polyval(Z, coefs)
predictions = pd.DataFrame(Z,pred)
print predictions
plt.plot(X, y, 'ro', label="Original data")
plt.plot(X_new, ffit, label = "Fitted data")
plt.legend(loc='upper left')
plt.show()

This is the best solutions as of now. Is there a way where we can also have control over the predicted values? by that I mean, can we somehow decrease the value predicted for a year. Suppose here in this case we get predicted value for 2019 as 271917.56, can we somehow bring the values predicted to a lesser number for all years? I hope you understand? I tried the ARIMA model too yesterday on the dataset, but since there is no seasonality in the data I wasn't able to change it 'stationary' dataset even after the first difference, second difference, seasonal first difference. Nothing worked — PratikSharma, Jul 02 '18 at 13:02
It is great to hear that my answer helped you. Since it the problem statement, i think it is better if you accept my answer and comeup with a new problem. Will help you for sure. — Surani Matharaarachchi, Jul 02 '18 at 13:48

ababuji · Answer 2 · 2018-07-01T12:19:50.240

EDIT: MY answer is wrong, I'VE USED TO A CLASSIFIER INSTEAD OF A REGRESSOR; NOT DELETING IT BECAUSE I'M SCARED OF GETTING MYSELF BANNED FROM POSTING MORE ANSWER. DO NOT USE THIS ANSWER.

Try this out

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame([[i for i in range(2000,2018)], 
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])


df = df.T
df.columns = ['Year', 'Values']
df['Year'] = df['Year'].astype(int)
df['Values'] = df['Values'].astype(int)

Your DataFrame

X = df[['Year']]
y = df[['Values']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0, shuffle = False)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, y_train)


y_pred = clf.predict(X_test)

plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, clf.predict(X_train), color = 'blue')
plt.title('Years vs Values (training set)')
plt.xlabel('Years')

plt.xticks(rotation=90)
plt.ylabel('Values')
plt.show()

there is no danger of being banned for deleting your answer, and since it is wrong, you should do it... — desertnaut, Jul 03 '18 at 21:29

score 0 · Answer 3 · answered Jul 01 '18 at 12:12

Meanwhile, I also tried

import numpy.polynomial.polynomial as poly
X = np.array(df.Years, dtype = float)
y = np.array(df.Values, dtype = float)
coefs = poly.polyfit(X, y, 4)
X_new = np.linspace(X[0], X[-1], num=17)
ffit = poly.polyval(X_new, coefs)
plt.plot(X, y, 'ro', label="Original data")
plt.plot(X_new, ffit, label = "Fitted data")
plt.legend(loc='upper left')
plt.show()

It did gave an almost perfect fit. But now I'm unclear on how using these I can predict values for next five-years.

Fitting a non-linear univariate regression to time-series data in Python

3 Answers3