1

So i am having some trouble understanding the standardisation processes of this KNN classifier. Basically i need to know what is happening in the standardisation processes. If someone could help it would be greatly appreciated. I know that there is being a variable of the mean and std being made of the "train examples" but what's actually going on after that is what i am having difficulty with.

classdef myknn
methods(Static)

                %the function m calls the train examples, train labels
                %and the no. of nearest neighbours.
    function m = fit(train_examples, train_labels, k)

            % start of standardisation process
        m.mean = mean(train_examples{:,:});  %mean variable
        m.std = std(train_examples{:,:}); %standard deviation variable
        for i=1:size(train_examples,1)
            train_examples{i,:} = train_examples{i,:} - m.mean;
            train_examples{i,:} = train_examples{i,:} ./ m.std;
        end
            % end of standardisation process

        m.train_examples = train_examples;
        m.train_labels = train_labels;
        m.k = k;

    end

    function predictions = predict(m, test_examples)

        predictions = categorical;

        for i=1:size(test_examples,1)

            fprintf('classifying example example %i/%i\n', i, size(test_examples,1));

            this_test_example = test_examples{i,:};

            % start of standardisation process
            this_test_example = this_test_example - m.mean;
            this_test_example = this_test_example ./ m.std;
            % end of standardisation process

            this_prediction = myknn.predict_one(m, this_test_example);
            predictions(end+1) = this_prediction;

        end

    end

    function prediction = predict_one(m, this_test_example)

        distances = myknn.calculate_distances(m, this_test_example);
        neighbour_indices = myknn.find_nn_indices(m, distances);
        prediction = myknn.make_prediction(m, neighbour_indices);

    end

    function distances = calculate_distances(m, this_test_example)

        distances = [];

        for i=1:size(m.train_examples,1)

            this_training_example = m.train_examples{i,:};
            this_distance = myknn.calculate_distance(this_training_example, this_test_example);
            distances(end+1) = this_distance;
        end

    end

    function distance = calculate_distance(p, q)

        differences = q - p;
        squares = differences .^ 2;
        total = sum(squares);
        distance = sqrt(total);

    end

    function neighbour_indices = find_nn_indices(m, distances)

        [sorted, indices] = sort(distances);
        neighbour_indices = indices(1:m.k);

    end

    function prediction = make_prediction(m, neighbour_indices)

        neighbour_labels = m.train_labels(neighbour_indices);
        prediction = mode(neighbour_labels);

    end

end

end

rayryeng
  • 102,964
  • 22
  • 184
  • 193
MichaelG
  • 61
  • 8

1 Answers1

1

Standardization is the process of normalizing each feature in your training examples so that each feature has a mean of zero and a standard deviation of one. The procedure to do this would be to find the mean of each feature and standard deviation of each feature. After, we take each feature and subtract but its corresponding mean and divide by its corresponding standard deviation.

That can clearly be seen by this code:

    m.mean = mean(train_examples{:,:});  %mean variable
    m.std = std(train_examples{:,:}); %standard deviation variable
    for i=1:size(train_examples,1)
        train_examples{i,:} = train_examples{i,:} - m.mean;
        train_examples{i,:} = train_examples{i,:} ./ m.std;
    end

m.mean remembers the mean of each feature while m.std remembers the standard deviation of each feature. Take note that you must remember both of these when you want to perform the classification at test time. That can be seen by the predict method you have where it takes the test features and subtracts by the mean and standard deviation of each feature from the training examples.

function predictions = predict(m, test_examples)

    predictions = categorical;

    for i=1:size(test_examples,1)

        fprintf('classifying example example %i/%i\n', i, size(test_examples,1));

        this_test_example = test_examples{i,:};

        % start of standardisation process
        this_test_example = this_test_example - m.mean;
        this_test_example = this_test_example ./ m.std;
        % end of standardisation process

        this_prediction = myknn.predict_one(m, this_test_example);
        predictions(end+1) = this_prediction;

    end

Take note that we're using m.mean and m.std on the test examples and these quantities come from the training examples.

My post on standardization should provide some more context. In addition, it achieves the same effect as the code you have provided but in a more vectorized fashion: How does this code for standardizing data work?

rayryeng
  • 102,964
  • 22
  • 184
  • 193