1

I try to practice Linear Regression by analyzing data file Google Apps Store to predict Rating, the file csv is on Kaggle.

After cleaning and trying to apply KNeighborsRegressor to run the model, as the results, the accuracy and r-squared are too low and I don't know why.
However the difference between predictions and y-test is not much and MSE is quite low.

I think there is some mistakes here, I hope you could help me to fix it. I would like to reach the accuracy about 90%.

import re
import sys

import time
import datetime

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
df = pd.read_csv('googleplaystore.csv')

df['Rating'] = df['Rating'].fillna(df['Rating'].median())


replaces = [u'\u00AE', u'\u2013', u'\u00C3', u'\u00E3', u'\u00B3', '[', ']', "'"]
for i in replaces:
    df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : x.replace(i, ''))

regex = [r'[-+|/:/;(_)@]', r'\s+', r'[A-Za-z]+']
for j in regex:
    df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : re.sub(j, '0', x))

df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : x.replace('.', ',',1).replace('.', '').replace(',', '.',1)).astype(float)
df['Current Ver'] = df['Current Ver'].fillna(df['Current Ver'].median())
 df.drop([10472], axis = 0, inplace = True)
le = preprocessing.LabelEncoder()
df['App'] = le.fit_transform(df['App'])
category_list = df['Category'].unique().tolist() 
category_list = ['cat_' + word for word in category_list]
df = pd.concat([df, pd.get_dummies(df['Category'], prefix='cat')], axis=1)
df['Genres'] = df['Genres'].str.split(';').str[0]
df['Genres'].replace('Music & Audio', 'Music', inplace =True)
le = preprocessing.LabelEncoder()
df['Genres'] = le.fit_transform(df['Genres'])
le = preprocessing.LabelEncoder()
df['Content Rating'] = le.fit_transform(df['Content Rating'])
df['Price'] = df['Price'].apply(lambda x : x.strip('$'))
df['Installs'] = df['Installs'].apply(lambda x : x.strip('+').replace(',', ''))
df['Type'] = pd.get_dummies(df['Type'])

def change_size(size):
    if 'M' in size:
        x = size[:-1]
        x = float(x)*1000000
        return(x)
    elif 'k' == size[-1:]:
        x = size[:-1]
        x = float(x)*1000
        return(x)
    else:
        return None

df['Size'] = df['Size'].apply(change_size)
df['Size'] = df['Size'].fillna(value=df['Size'].median(), axis = 0)
df['new'] = pd.to_datetime(df['Last Updated'])
df['lastupdate'] = (df['new'] - df['new'].max()).dt.days
features = ['App', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'lastupdate','Content Rating', 'Genres', 'Current Ver']
features.extend(category_list)
X = df[features]
y = df['Rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
model = KNeighborsRegressor(n_neighbors=28)
predictions = model.predict(X_test)
model.fit(X_train, y_train)

accuracy = model.score(X_test,y_test)
'Accuracy: ' + str(np.round(accuracy*100, 2)) + '%'


from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

result = pd.DataFrame({'Actual': y_test, 'Predicted': predictions}) 
result
Thoa Ng
  • 49
  • 1
  • 7

1 Answers1

0

The way accuracy is being measured here is to exactly match the rating (as in a classifier). For example, if the actual rating is 4.3 and the prediction is 4.3001, it will be counted as an error. But, what is being done here is regression and in this case MSE and MAE are better metrics.

You could try bucketing the y_test and the predictions to get an idea of the "accuracy" like so:

>>> metrics.accuracy_score(np.around(y_test), np.around(predictions))
0.7666051660516605
user650654
  • 5,630
  • 3
  • 41
  • 44