0

I have to create a function that will simaltenously split my dataframe into test and train sets and then run the simplest regression model and return RMSE as a result. The task is to predict the revenue column by id and year.

I am quite new to this and I am stuck at the part where I have to split test and train sets, and figuring out how to join everything together.

Required split:

  • train: all from 2021
  • test: the rest of the years

Data example:

df = pd.DataFrame({'year': [2020, 2021, 2021, 2021, 2022],
                   'id': [3, 1, 1, 2, 1],
                   'price': [12, 20, 30, 40, 50],
                   'age': [20, 30, 30, 44, 31],
                   'product': [book, toy, book, tv, book],
                   'quantity': [2,1,2,5,9})

my code so far:

df['revenue'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year 

#this creates extra variables and puts them the way I need it 
def table(df):
   df2 = (df.groupby(['year','id'])
            .agg(revenue = ('revenue', 'sum'), 
                 age = ('age', 'unique'),
                 product_year_before = ('product', 'nunique'),
                 revenue_year_before =('revenue','sum')))
   s = df2['product_year_before'].rename(lambda x: x + 1, level=0)
   df2 = df2.drop('product_year_before',axis=1).join(s)
   s2 = df3['revenue_year_before'].rename(lambda x: x + 1, level=0)
   return df3= df3.drop('revenye_year_before',axis=1).join(s2)
table_result = table(df)                      
table(df)

Question 1: I tried to split this way, but I get an error "only list-like objects are allowed to be passed to isin(), you passed a [int]". Is there a way to fix this?

   test = df3[df3['year'].isin(2018)] 
   train = df3[~df3['year'].isin(2018)]

Question 2: How do I incorporate linear regression model to predict "revenue" column to the above function if my results need to be this:

MSE = np.square(np.subtract(y_actual,y_predicted)).mean() 
RMSE = math.sqrt(MSE)
return RMSE
Pixel
  • 97
  • 7
  • Does [Pandas filter dataframe rows with a specific year](https://stackoverflow.com/questions/46878156/pandas-filter-dataframe-rows-with-a-specific-year) answer your question? – wwii Oct 17 '22 at 13:44
  • Please limit your questions to one question. – wwii Oct 17 '22 at 13:45
  • What is the purpose of `df['year'] = pd.DatetimeIndex(df['date']).year`? Isn't `df['year'] already numeric? – wwii Oct 17 '22 at 15:57

1 Answers1

1

Question 1:

As the error message states, .isin() requires a list, e.g. a list of years such as [2017, 2018, 2020]. You only pass an integer 2018. What you can do is:

test = df[df['year'] == 2018] 
train = df[df['year'] != 2018]

... or whatever year you actually want in your test data.

Question 2:

Not entirely sure what you mean, but you could use scikit-learn's LinearRegression function. Use reg = LinearRegression().fit(X, y) to fit your regression, where y is your revenue column and X are your predictors. Scikit-learn also has a function for RMSE calculation. Together, use something like:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Xtrain = train.drop('revenue', axis = 1).values
ytrain = train.revenue.values.ravel()
Xtest = test.drop('revenue', axis = 1).values
ytest = test.revenue.values.ravel()

reg = LinearRegression().fit(Xtrain, ytrain)
ypred = reg.predict(Xtest)
RMSE = mean_squared_error(ytest, ypred, squared = False)
RamsesII
  • 404
  • 3
  • 10