0

I developed an algorithm of gradient descent, but when I try it with some sklearn exampples the results are incorrect and I do not know how to fix it. This is the full algorithm:

First of all I have a class for an exception and a function to calculate score called rendimiento():

class ClasificadorNoEntrenado(Exception): pass

def rendimiento(clasificador, X, y):
aciertos = 0
total_ejemplos = len(X)

    for i in range(total_ejemplos):
        ejemplo = X[i]
        clasificacion_esperada = y[i]
        clasificacion_obtenida = clasificador.clasifica(ejemplo)
        
        if clasificacion_obtenida == clasificacion_esperada:
            aciertos += 1
    
    accuracy = aciertos / total_ejemplos
    return accuracy

Second, I have a function that calculates sigmoide:

from scipy.special import expit

def sigmoide(x):
    return expit(x)

Third, I have the main algorithm:

class RegresionLogisticaMiniBatch():

    def __init__(self,clases=[0,1],normalizacion=False,
                  rate=0.1,rate_decay=False,batch_tam=64):
    
        self.clases = clases;
        self.rate = rate;
        self.normalizacion = normalizacion;
        self.rate_decay = rate_decay;
        self.batch_tam = batch_tam;
        self.pesos = None
        self.media = None
        self.desviacion = None
    
    
    def entrena(self, X, y, n_epochs, reiniciar_pesos=False, pesos_iniciales=None):
        self.X = X
        self.y = y
        self.n_epochs = n_epochs
        
        if reiniciar_pesos or self.pesos is None:
            self.pesos = pesos_iniciales if pesos_iniciales is not None else np.random.uniform(-1, 1, size=X.shape[1])
    
        if self.normalizacion:
            self.media = np.mean(X, axis=0)
            self.desviacion = np.std(X, axis=0)
    
        indices = np.random.permutation(len(X))
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        for i in range(0, len(X), self.batch_tam):
            batch_X = X_shuffled[i:i + self.batch_tam]
            batch_y = y_shuffled[i:i + self.batch_tam]
        
            # Compute logistic function (sigmoid)
            z = np.dot(batch_X, self.pesos)
            y_pred = sigmoide(z)
    
            # Compute gradient
            error = batch_y - y_pred
            gradiente = np.dot(batch_X.T, error) / len(batch_X)
    
            # Update weights
            self.pesos += self.rate * gradiente
    
    def clasifica_prob(self, ejemplo):
        if self.pesos is None:
            raise ClasificadorNoEntrenado("El clasificador no ha sido entrenado")
    
        if self.normalizacion:
            ejemplo = (ejemplo - self.media) / self.desviacion
    
        probabilidad = sigmoide(np.dot(ejemplo, self.pesos))
        if probabilidad >= 0.5:
            return 1
        else:
            return 0
    
        
        #return {'no': 1 - probabilidad, 'si': probabilidad} 
     
    def clasifica(self,ejemplo):
        probabilidad = self.clasifica_prob(ejemplo)
        return probabilidad

And finally I try to see if it is correct with a dataset of sklearn:

from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()

X_cancer,y_cancer=cancer.data,cancer.target

lr_cancer=RegresionLogisticaMiniBatch(rate=0.1,rate_decay=True,normalizacion=True)

Xe_cancer, Xt_cancer, ye_cancer, yt_cancer = train_test_split(X_cancer, y_cancer);

lr_cancer.entrena(Xe_cancer,ye_cancer,10000)

print(rendimiento(lr_cancer,Xe_cancer,ye_cancer))

print(rendimiento(lr_cancer,Xt_cancer,yt_cancer))

But the results are very random and low.

I tried to developed a Logistic Regression with gradient descent and mini batch algorithm but it not predicts correct, hope someone could help me to fix this.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66

1 Answers1

1

There are two problems that I see here.

The first problem is normalization. The self.normalizacion variable is used to control whether normalization is used. However, when it is set, it affects prediction, but not training.

If the weights of your model are learned on a non-normalized dataset, they will perform very poorly on a normalized dataset.

I suggest changing your code like this to normalize during training:

        # within entrena
        if self.normalizacion:
            self.media = np.mean(X, axis=0)
            self.desviacion = np.std(X, axis=0)
            X = (X - self.media) / self.desviacion

The second problem is epochs. Training epochs refers to when a model re-trains on the same dataset many times, so that randomly assigned weights move closer to their ideal values. You have an n_epochs variable, but it does not do anything.

Therefore, I suggest having two loops. The outer loop loops over epochs. The inner one loops over minibatches.

        # within entrena
        for j in range(n_epochs):
            for i in range(0, len(X), self.batch_tam):
                # same loop as before

With these changes I can get 99% accuracy on the training set, and 95% accuracy on the test set.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66