-1

I have a single table in my database that contains 40 million user data entries. My goal is to determine whether a user is present or not, based on their unique ID.

However, if I were to provide an incorrect ID, the database would search through all 40 million entries before returning an empty result set.

To optimize this process, I am considering not checking the database at all if the user ID is not present in the database. This would help to reduce the number of unnecessary queries made to the database.

Would it be possible to use a bloom filter in this case?

Andreas Violaris
  • 2,465
  • 5
  • 13
  • 26
Alex Sparrow
  • 197
  • 2
  • 7
  • why don't you just use indexes? – fshabashev Mar 13 '23 at 10:38
  • @fshabashev: I agree with your point. Can I use it without an indexing column? I want to search name or email – Alex Sparrow Mar 14 '23 at 06:46
  • It doesn't have to search all 40 million entries. If using binary search, the correct entry can be determined to exist or not exist with ~26 comparisons. Why do you think that the database would have to search through all entries? – knittl Mar 16 '23 at 15:58

2 Answers2

1

Traditional approach is to build an index on data and do look ups there. That will give you sub linear lookup time (in big O notation terms).

Even with sub linear time, you still may get some meaningful gains with a bloom filter. If there are many lookup for non existing records, then a bloom filter will be a cheap approach to terminate those lookups early.

Overall, having an index and/or a bloom filter adds complexity to your system; this is where I would explore how the system performs around given requirements.

AndrewR
  • 1,252
  • 8
  • 7
0

If a key is added to a Bloom filter, the filter will not falsely report that it is absent in later queries. However, if a key is never added to a Bloom filter, the filter may falsely report its presence. In other words, you won't get false negatives, but you may get false positives. This probability of a false positive depends on the size of the filter. This doesn't seem like a huge problem; there is only a small chance of doing a search unnecessarily, and in the end, you get the correct result either way.

However, you say that your user IDs are stored in a "table" in a "database." If it's an indexed field in a relational database, you should be able to test for its presence in O(log(n)) time; for 40 million entries, this is on the order of 25 operations, and I would expect this time to be dwarfed by overhead like network traffic. And, with this approach, the index search can easily result in additional information, like the user record, while a Bloom filter check can't associate values with a key.

So, my recommendation would be to use something that supports a traditional index, and avoid the novelty of a Bloom filter for this. I would look for alternatives only when supported by profiling metrics that show that lookup misses are a bottleneck.

erickson
  • 265,237
  • 58
  • 395
  • 493