I have to create a function that will simaltenously split my dataframe into test and train sets and then run the simplest regression model and return RMSE as a result. The task is to predict the revenue column by id and year.
I am quite new to this and I am stuck at the part where I have to split test and train sets, and figuring out how to join everything together.
Required split:
- train: all from 2021
- test: the rest of the years
Data example:
df = pd.DataFrame({'year': [2020, 2021, 2021, 2021, 2022],
'id': [3, 1, 1, 2, 1],
'price': [12, 20, 30, 40, 50],
'age': [20, 30, 30, 44, 31],
'product': [book, toy, book, tv, book],
'quantity': [2,1,2,5,9})
my code so far:
df['revenue'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year
#this creates extra variables and puts them the way I need it
def table(df):
df2 = (df.groupby(['year','id'])
.agg(revenue = ('revenue', 'sum'),
age = ('age', 'unique'),
product_year_before = ('product', 'nunique'),
revenue_year_before =('revenue','sum')))
s = df2['product_year_before'].rename(lambda x: x + 1, level=0)
df2 = df2.drop('product_year_before',axis=1).join(s)
s2 = df3['revenue_year_before'].rename(lambda x: x + 1, level=0)
return df3= df3.drop('revenye_year_before',axis=1).join(s2)
table_result = table(df)
table(df)
Question 1: I tried to split this way, but I get an error "only list-like objects are allowed to be passed to isin(), you passed a [int]". Is there a way to fix this?
test = df3[df3['year'].isin(2018)]
train = df3[~df3['year'].isin(2018)]
Question 2: How do I incorporate linear regression model to predict "revenue" column to the above function if my results need to be this:
MSE = np.square(np.subtract(y_actual,y_predicted)).mean()
RMSE = math.sqrt(MSE)
return RMSE