co occurace matrix python

Question

Given that I have a data set with 2 columns containing text data, I have to concatenate these 2 columns and then find the top 2k words using idf_ values and then use these words to create a co-occurrence matrix. I am getting with below code an index error. Can any one please provide me the code to get the working co occurrence matrix.

singular value decomposition: SVD

def get_words_in_window(sent, w, window = 5):
    context_words = []
    for index, word in enumerate(sentence.split()):
        if word  == w:
            if index < window:
                lower_index = 0
                upper_index = window+index
            elif len(sentence.split()) - index <= window:
                lower_index = index - window
                upper_index = len(sentence.split())-1
            else:
                lower_index = index - window
                upper_index = index + window
            for i in range(lower_index, upper_index+1):
                if i != index:
                    context_words.append(sentence.split()[i])
    return context_words

from tqdm import tqdm
for sentence in tqdm(essays_titles['essay_title']):
    for w in sentence.split():
        if w in top_2k_words:
            context_words = get_words_in_window(sentence, w)
            for w2 in context_words:
                if w2 in top_2k_words:
                    cooc_matrix[top_2k_words.index(w)][top_2k_words.index(w2)]+=1

The error:

IndexError: list index out of range

We're not a code on request service. Please show effort yourself before asking "please code for me". Tell us what you tried to get our attention and dig into the issue you propose. — ZF007, Aug 26 '19 at 22:34
okay the above code is what i have tried and is giving me index error@ZF007 — nikhil, Aug 27 '19 at 00:57
cooc_matrix = np.zeros(shape =(len(top_2k_words), len(top_2k_words))) — nikhil, Aug 27 '19 at 01:27
Providing the code helps a lot, however, "sentence" is not defined and your code is not according PEP8 standard (import statements first. Then classes/defs followed by working code / if name=main and then followed by code. — ZF007, Aug 27 '19 at 06:03

score 0 · Answer 1 · answered Aug 27 '19 at 06:10

The linkage between variables sentence and sent was done once in the for-loop and in the def the old word was repeatedly used therefore your list goes out of index. See my inline comment where its repaired.

from tqdm import tqdm

def get_words_in_window(sent, w, window = 5):              # sentence -> sent
    context_words = []
    for index, word in enumerate(sent.split()):            # sent = sentence
        if word  == w:
            if index < window:
                lower_index = 0
                upper_index = window+index
            elif len(sent.split()) - index <= window:      # sent = sentence
                lower_index = index - window
                upper_index = len(sentence.split())-1
            else:
                lower_index = index - window
                upper_index = index + window
            for i in range(lower_index, upper_index+1):
                if i != index:
                    context_words.append(sent.split()[i])  # sent = sentence
    return context_words


for sentence in tqdm(essays_titles['essay_title']):
    for w in sentence.split():
        if w in top_2k_words:
            context_words = get_words_in_window(sentence, w)    # here "sentence" = linked to "sent" correctly.
            for w2 in context_words:
                if w2 in top_2k_words:
                    cooc_matrix[top_2k_words.index(w)][top_2k_words.index(w2)]+=1

actually i have used sentence every where other than the in def. i think whilr pasting it here some problem occured. — nikhil, Aug 27 '19 at 09:25

co occurace matrix python

1 Answers1