An alternative for binary search on a frozenset in Python

Question

I need to perform binary search on a frozenset, but as indexing doesn't work on frozenset, I cannot use the bisect library. I thought of converting the frozenset to a list to make things easy, but the problem is that the conversion (list(frozenset)) disarranges the order and then I cannot perform binary search. What solution do you suggest?
Just to be more clear, let me explain what exactly I'm doing: In an NLP task, I need to remove stopwords from my text, so I have imported the stopwords from scikit-learn (it has a better collection of stopwords than NLTK in my opinion):
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
And it returns a frozenset in which the stopwords are in alphabetical order. And now that I want to remove stopwords from my text, it's better to check if a token is in the stopwords using binary search (obviously because I have stopwords in alphabetical order and it's efficient to perform binary search). So it is as follows:

import bisect

bisect.bisect(ENGLISH_STOP_WORDS, word)

And this is where I'm stuck! I was expecting to find the desired index in stopwords list with the above code, and then compare my word with the one before and after it in the list. But I get this error: TypeError: 'frozenset' object does not support indexing.

FYI, I have not tried other libraries stopwords list (spaCy, gensim, etc.), so I don't know if they work better in this case. But the main point here is to learn handling the binary search on the frozenset. Thanks in advance.

`it returns a frozenset in which the stopwords are in alphabetical order` is a surprising sentence. Sets and frozensets are *unordered* collections — Sylvaus, May 27 '20 at 13:52
You don't NEED to do a binary search on a set. Sets directly support efficient membership testing via the `in` operator, that's the whole point of them! — jasonharper, May 27 '20 at 13:53
@jasonharper I didin't know this fact. Thank you for the point. — Arash Ashrafzadeh, May 27 '20 at 14:27
For those interested, I found [this video](https://www.youtube.com/watch?v=C4Kc8xzcA68) sent to me by my friend @amirhossein really helpful. — Arash Ashrafzadeh, May 28 '20 at 06:16

score 3 · Accepted Answer · answered May 27 '20 at 13:54

3

If you want to know if the word is a stop word, simply do:

if word in ENGLISH_STOP_WORDS:
    pass

answered May 27 '20 at 13:54

Sylvaus

844
6
13

Thanks @Sylvaus, but can we be sure that ```in``` performs a binary search so it will be efficient? – Arash Ashrafzadeh May 27 '20 at 14:05
1

@ArashAshrafzadeh `in` with a set is O(1), even faster than binary search which is O(logn). But you also want the index right? – RoadRunner May 27 '20 at 14:05
1

@RoadRunner-MSFT Oh, I didn't know the point you mentioned. No I don't need the index, and only membership checking is enough. Thank you so much! – Arash Ashrafzadeh May 27 '20 at 14:09
1

You can use this [link](https://wiki.python.org/moin/TimeComplexity) when you want to know the time complexity of most operations on standard containers – Sylvaus May 27 '20 at 14:12
@Sylvaus I was not aware of these. Thanks for the link. – Arash Ashrafzadeh May 27 '20 at 14:25

An alternative for binary search on a frozenset in Python

1 Answers1