I've recently started machine learning using python. Below is a dataset I picked up as an example along with the codes I've worked on till now. Chosen [2000....2015] as the test data and train data [2016, 2017].
Dataset
Years Values
0 2000 23.0
1 2001 27.5
2 2002 46.0
3 2003 56.0
4 2004 64.8
5 2005 71.2
6 2006 80.2
7 2007 98.0
8 2008 113.0
9 2009 155.8
10 2010 414.0
11 2011 2297.8
12 2012 3628.4
13 2013 16187.8
14 2014 25197.8
15 2015 42987.8
16 2016 77555.5
17 2017 130631.9
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
df = pd.DataFrame([[i for i in range(2000,2018)],
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])
df = df.T
df.columns = ['Years', 'Values']
The above code creates the DataFrame. Another important thing to keep in mind is that my Years
column is a TIME-SERIES and not just a continuous value. I haven't made any changes to accomodate this.
I'm want to fit non-linear models that may help and print the plots like I've done for my linear model example. Here is what I've tried using a linear model. Also, in my own example, I do not seem to be accounting for the fact that my Years
column is a time series and NOT continuous.
Once, we've the model, would like to use that for predicting values for next couple of years atleast.
X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0, shuffle = False)
lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, lm.predict(X_train), color = 'blue')
plt.title('Years vs Values (training set)')
plt.xlabel('Years')
plt.ylabel('Values')
plt.show()