0

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.

I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.

Secondly, I need a list of these pairs based on file names.

So the output for the example below would look like:

1

and

["doc1", "doc4"]

Will really appreciate your help as I feel a bit lost not knowing which direction to go.

This is an example of my script to get the matrix:

doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]

# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
#    documents = []
#    for filename in glob.iglob(path+'*.txt'):
#        _file = open(filename, 'r')
#        text = _file.read()
#        documents.append(text)
#    return documents

import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
    return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
    return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()

from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()

from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
    tfidf = TfidfVec.fit_transform(textlist)
    return (tfidf * tfidf.T).toarray()
cos_similarity(documents)

Out:

array([[ 1.        ,  0.1459739 ,  0.03613371,  0.76357693],
       [ 0.1459739 ,  1.        ,  0.11459266,  0.19117117],
       [ 0.03613371,  0.11459266,  1.        ,  0.04732164],
       [ 0.76357693,  0.19117117,  0.04732164,  1.        ]])
aviss
  • 2,179
  • 7
  • 29
  • 52
  • Look! In the previous snippet, I can see that you are creating functions left and right. We only create functions when we are using the same piece of code over and over, so to save time and effort we create functions. But, when the function is just one line or we will use it once or so, it's better if we don't create them which will make the code cleaner, more readable and more understandable. – Anwarvic Dec 19 '17 at 00:20
  • You are absolutely right! I just took this snippet from an online tutorial but I will tidy it up as you suggested. – aviss Dec 19 '17 at 11:32

1 Answers1

2

As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:

  • how many docs are bigger than or equal the given threshold
  • the names of these docs.

So, here I've made the following function which takes three arguments:

  • the output numpy array from cos_similarity() function.
  • list of document names.
  • a certain number (threshold).

And here it's:

def get_docs(arr, docs_names, threshold):
    output_tuples = []
    for row in range(len(arr)):
        lst = [row+1+idx for idx, num in \
                  enumerate(arr[row, row+1:]) if num >= threshold]
        for item in lst:
            output_tuples.append( (docs_names[row], docs_names[item]) )

    return len(output_tuples), output_tuples

Let's see it in action:

>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1.        ,  0.1459739 ,  0.03613371,  0.76357693],
   [ 0.1459739 ,  1.        ,  0.11459266,  0.19117117],
   [ 0.03613371,  0.11459266,  1.        ,  0.04732164],
   [ 0.76357693,  0.19117117,  0.04732164,  1.        ]])
>>> threshold = 0.5   
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])

Let's see how this function works:

  • first, I iterate over every row of the numpy array.
  • Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so: and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
  • Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.
Anwarvic
  • 12,156
  • 4
  • 49
  • 69
  • Exactly what I was looking for. Thank you so much. I'm reading the function you created and not sure I understand how you remove diagonal pairs (num == 1). Could you explain, please? – aviss Dec 19 '17 at 11:38
  • Brilliant! Thank you very much. – aviss Dec 19 '17 at 17:10