I have a large dataset of news articles, 48000 to be precise. I have made ngrams of each article where n = 3
.
my ngrams look like this:
[[(tikro, enters, into), (enter, into, research), (into, research, and),...]]
now I need to make a binary matrix of each shingle and article:
article1 article2 article3
shingle1 1 0 0
shingle2 1 0 1
shingle3 0 1 0
At first I have kept all the shingles in a single list. After that, I have tried this to check if it works.
for art in article:
for sh in ngrams:
if sh in art:
print('found')
as one is set and another is string it does not work. any suggestions, how to make it work? or any other approach?
thank you