Counting number of co-occurrences of words for a specified vocabulary and within a specified radius?

Question

I have a vocabulary V = ["anarchism", "originated", "term", "abuse"], and list of words test = ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'abuse', 'the', 'english', 'term', 'anarchism'].

I'd like to count the number of co-occurrences within a radius R in the list of words test between each word in the vocabulary V. If R=5, say, then we look 5 words to the left of a given vocabulary word in V and 5 words to the right. We then count the number of times each word in V occurs within that radius of 5.

For example, let's take the first word in the V, "anarchism." The word "anarchism" occurs first and last in test. After the first occurrence, we look 5 words to the left (i.e. nothing) and 5 words to the right ('originated', 'as', 'a', 'term', 'of'). Is any of these "anarchism"? No. For the last occurrence of "anarchism", we look 5 words to the left ('diggers', 'abuse', 'the', 'english', 'term') and 5 words to the right (again, nothing). Hence "anarchism" does not occur within a radius of 5 words with itself, so the (0, 0) entry of the output matrix corresponding to ("anarchism", "anarchism") is 0. However, the word "originated" occurs once within 5 words of "anarchism", so the (0, 1) entry (i.e. the ("anarchism", "originated")) cell of the output matrix is 1. Similarly, the word "term" occurs once within radius 5 of the first occurrence of "anarchism" and once within radius 5 of the second occurrence of "anarchism", so the (0, 2) entry of the output is 2. We continue in this way for each word in the vocabulary V.

The resulting output is therefore a 4x4 matrix (since there are 4 words in V), and it is symmetric, since for example the counts of co-occurrences for ("anarchism", "originated") are the same as ("originated", "anarchism").

For this example, the output (e.g. numpy array) looks like:


0	1	2	1
1	0	1	1
2	1	0	2
1	1	2	0

Each row and column corresponds to the respective entries of V. How can I implement this in Python?

Hi! Sorry your question was closed. It can be reopened if you edit it to make it easier to understand. Right now it's actually really hard to understand what problem you are trying to solve. Several things are not obvious in your example: why do we look 5 words to the left but 3 to the right? Why is the output a 4x4 symmetric matrix? Also, what is your actual question? "Any help" is not really a question. — Stef, May 10 '22 at 07:55
In particular can you explain what you would do for a specific word. Take word `"term"` in the vocabulary. What do you do with that word, what occurrences do you count, and where does this show in the final 4x4 matrix? — Stef, May 10 '22 at 07:57
@Stef I have clarified and fixed the question--sorry for the confusion. — Jake, May 10 '22 at 18:13

Counting number of co-occurrences of words for a specified vocabulary and within a specified radius?

0 Answers0