Why is all of the different scatter points gathered closely in the same area in logistic regression graph?

Question

I am currently doing a project where my team and I have to pick a dataset and apply some machine learning methods on it (SLR, MLR etc.), hence for me, I am doing logical regression. My dataset is related to the top hit songs on Spotify from 2010-2019, and I want to see how the duration and danceability of a song affects its popularity. Given that the popularity values is numerical, I have converted the popularity value of each song to binary values. Hence, the popularity value of a song will change to "0" if it is below 65, and "1" if it is above the value of 65. I then decided to plot a 2d logistic regression plot for two dimensions. The end result is that both the "0" and "1" values are all gathered in the same area, where they are supposed to be separated from each other and there should be a decision boundary at .5 showing. I just want to know what does this show about the relationship between the popularity of the songs and their duration and danceability respectively. Is this supposed to be normal or did i make a mistake?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
from sklearn.linear_model import LogisticRegression 

df = pd.read_csv('top10s [SubtitleTools.com] (2).csv')

BPM = df.bpm
BPM = np.array(BPM)
Energy = df.nrgy
Energy = np.array(Energy)
dB = df.dB
dB = np.array(dB)
Live = df.live
Live = np.array(Live)
Valence = df.val
Valence = np.array(Valence)
Acous = df.acous
Acous = np.array(Acous)
Speech = df.spch
Speech = np.array(Speech)

def LogReg0732():
    Dur = df.dur
    Dur = np.array(Dur)
    Dance = df.dnce
    Dance = np.array(Dance)
    Pop = df.popu

    df.loc[df['popu'] <= 65, 'popu'] = 0

    df.loc[df['popu'] > 65, 'popu'] = 1

    Pop = np.array(Pop)

    X = Dur

    X = np.stack((X, Dance))

    y = Pop

    clf = LogisticRegression().fit(X.T, y)
    print("Coef ", clf.intercept_, clf.coef_)
    xx, yy = np.mgrid[np.min(Dur):np.max(Dur), np.min(Dance):np.max(Dance)]
    gridxy = np.c_[xx.ravel(), yy.ravel()]
    probs = clf.predict_proba(gridxy)[:,1].reshape(xx.shape)
    f, ax = plt.subplots(figsize=(20,8))
    contour = ax.contourf(xx, yy, probs, 25, cmap="BrBG", vmin=0, vmax=1)
    ax_c = f.colorbar(contour)
    ax_c.set_ticks([0, 1/4, 1/2, 3/4, 1])
    idx = np.where(y==1); idx = np.reshape(idx,np.shape(idx)[1])
    y1 = X[:,idx]
    idx = np.where(y==0); idx = np.reshape(idx,np.shape(idx)[1])
    y0 = X[:,idx]
    ax.scatter(y1[0,:], y1[1,:], c='green')
    ax.scatter(y0[0,:], y0[1,:], c='blue')
    plt.xlabel('Danceability')
    plt.ylabel('Duration')
    plt.savefig('LR1.svg')
    plt.show()

LogReg0732()

My current graph

Example of type of graph I'm expecting

My dataset

For your reference, capitalized identifiers are reserved for classes. Do not use them to name variables, do not confuse your readers (us). — DYZ, Jan 06 '20 at 04:23

score 0 · Answer 1 · answered Jan 07 '20 at 04:16

So the main reason, assuming the there is no mistake in the model building process, is that durability and dancibilility are not good features for your problem. You likely have to add more features.

To understand the model in detail you would have to run a variety of statistical test, but I think in short they will all result in the same answer, that this isn't a good fit.

1)If you dont want to add more features, you can also try changing the Cutoff value from 65 to something else, it might help.

2)Try normalizing your data.

Why is all of the different scatter points gathered closely in the same area in logistic regression graph?

1 Answers1