0

I've been seeing a lot of examples of computing euclidean distance for KNN but non for sentiment classification.

For example I have a sentence "a very close game"

How do I compute the euclidean distance for the sentence "A great game"?

Davide Patti
  • 3,391
  • 2
  • 18
  • 20
xx4xx4
  • 45
  • 7
  • It's unclear what do you mean by 'euclidean distance' for sentences. To get any sort of distance, you need to fix some encoding - for example you could use vectors of counts, their binary version, or tfidf vectors. – Jakub Bartczuk Sep 05 '17 at 19:30
  • suppose that you have a training data of [link](https://i.stack.imgur.com/PrqAF.png) and you have to classify using KNN the sentence "A very close game" ... something like that – xx4xx4 Sep 05 '17 at 19:38
  • This data has sentence strings. There are many ways to vectorize them, as I mentioned earlier. – Jakub Bartczuk Sep 05 '17 at 19:49

1 Answers1

1

Think about a sentence as about a point in multi-dimensional space, only after you will defined system of coordinates you can calculate Euclidean distance. For instance. You can introduce

  1. O1 - A sentence length (Length)
  2. O2 - A words number (WordsCount)
  3. O2 - Alphabetical center(I just thought of it). It can be calculated as arithmetical mean of alphabetical center of each work in a sentence.

    CharsIndex = Sum(Char.indexInWord) / CharsCountInWord; CharsCode = Sum(Char.charCode) / CharsCount; AlphWordCoordinate = [CharsIndex, CharsCode]; WordsIndex = Sum(Words.CharsIndex) / WordsCount; WordsCode = Sum(Words.CharsCode) / WordsCount; AlphaSentenceCoordinate = (WordsIndex ^2+WordsCode^2+WordIndexInSentence^2)^1/2;

So, the Euclidean distance can be calculated no as following:

EuclidianSentenceDistance = (WordsCount^2 + Length^2 + AlphaSentenceCoordinate^2)^1/2

No every sentence can be transformed to point in three-dimensional space, like P[Length, Words, AlphaCoordinate]. Having a distance you can compare and classify sentences.

It is not ideal approach I guess, but I wanted to show you an idea.

import math

def calc_word_alpha_center(word):
    chars_index = 0;
    chars_codes = 0;
    for index, char in enumerate(word):
        chars_index += index
        chars_codes += ord(char)
    chars_count = len(word)
    index = chars_index / len(word)
    code = chars_codes / len(word)
    return (index, code)


def calc_alpha_distance(words):
    word_chars_index = 0;
    word_code = 0;
    word_index = 0;
    for index, word in enumerate(words):
        point = calc_word_alpha_center(word)
        word_chars_index += point[0]
        word_code += point[1]
        word_index += index
    chars_index = word_chars_index / len(words)
    code = word_code / len(words)
    index = word_index / len(words)
    return math.sqrt(math.pow(chars_index, 2) + math.pow(code, 2) + math.pow(index, 2))

def calc_sentence_euclidean_distance(sentence):
    length = len(sentence)

    words = sentence.split(" ")
    words_count = len(words)

    alpha_distance = calc_alpha_distance(words)

    return math.sqrt(math.pow(length, 2) + math.pow(words_count, 2) + math.pow(alpha_distance, 2))


sentence1 = "a great game"
sentence2 = "A great game"

distance1 = calc_sentence_euclidean_distance(sentence1)
distance2 = calc_sentence_euclidean_distance(sentence2)

print(sentence1)
print(str(distance1))

print(sentence2)
print(str(distance2))

Console output

a great game
101.764433866
A great game
91.8477000256
slesh
  • 1,902
  • 1
  • 18
  • 29
  • im confused... can you try to put a computation using the example i have? for example like this link: https://stackoverflow.com/questions/17053459/how-to-transform-a-text-to-vector – xx4xx4 Sep 05 '17 at 17:58
  • I've added sample of code. You can play with it and try achieve good quality of function. Because for now, as you see that function is quick sensitive to minor changes like char register. – slesh Sep 05 '17 at 18:52
  • I've read the code but i think its different from what i'm trying to do... suppose that: Training Sentence: "A Great Game" Unlabeled Sentence: "A Very Close Game" I want to calculate the euclidean distance between the two sentences. from what iv'e read i'm supposed to convert each sentence into binary just like the link in my previous comment... – xx4xx4 Sep 05 '17 at 18:57
  • You can try to apply the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance), it is very close to what you need – slesh Sep 05 '17 at 20:09