I need to perform binary search on a frozenset, but as indexing doesn't work on frozenset, I cannot use the bisect
library. I thought of converting the frozenset to a list to make things easy, but the problem is that the conversion (list(frozenset)
) disarranges the order and then I cannot perform binary search. What solution do you suggest?
Just to be more clear, let me explain what exactly I'm doing: In an NLP task, I need to remove stopwords from my text, so I have imported the stopwords from scikit-learn (it has a better collection of stopwords than NLTK in my opinion):
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
And it returns a frozenset in which the stopwords are in alphabetical order. And now that I want to remove stopwords from my text, it's better to check if a token is in the stopwords using binary search (obviously because I have stopwords in alphabetical order and it's efficient to perform binary search). So it is as follows:
import bisect
bisect.bisect(ENGLISH_STOP_WORDS, word)
And this is where I'm stuck! I was expecting to find the desired index in stopwords list with the above code, and then compare my word with the one before and after it in the list. But I get this error:
TypeError: 'frozenset' object does not support indexing
.
FYI, I have not tried other libraries stopwords list (spaCy, gensim, etc.), so I don't know if they work better in this case. But the main point here is to learn handling the binary search on the frozenset. Thanks in advance.