3

The problem goes like this.

I need to fetch * from the table User where the column phoneNumber is in phoneNumberArray. There are around 300 to 1000 phone numbers. The table User has an index on phoneNumber and could contain millions of lines.

Would it be better to create a temporary table UserphoneNumbers and insert all the phone number and do a JOIN or use a simple WHERE IN and pass an array of 1000 phone numbers?

Nicolas Manzini
  • 8,379
  • 6
  • 63
  • 81
  • try them both and look at the execution plans and statistics, i would guess that using a temp table would be more expensive as its building and populating the table – dbajtr Oct 31 '17 at 09:15
  • Can you please show us a query, some data, or both? – Tim Biegeleisen Oct 31 '17 at 09:17
  • creating a temporary table for phone number would be best if your querying frequency is low and phone number list is static. But if the phone number list is dynamic, it`s better to use WHERE IN to reduce the table creation and dropping times – User123456 Oct 31 '17 at 09:20
  • I would basically create the table to query it only once then drop it. I would do this only when a new account is created. I'll try both and test the results – Nicolas Manzini Oct 31 '17 at 09:23
  • `WHERE IN ( ... )` implies that you need to look up a 'user' by any of lots of numbers. This seems like an unlikely use case. – Rick James Nov 05 '17 at 01:25
  • @RickJames Think of it more like mapping an array of phone number to an array of users with their data. – Nicolas Manzini Nov 06 '17 at 11:04

2 Answers2

3

This is an interesting question. There are a lot of arguments for using the second method -- putting the values in the array.

Using a temporary table is not an unreasonable option, but it adds significant complexity to the application. It adds reading and writing overhead to the database. The application has to create the table and load it (the database will take care of removing it).

If you do think that you need this option, be sure that the phone number is the primary key in the table. That is optimal for performance.

Using IN -- especially in MySQL -- is a reasonable option. MySQL orders constant IN lists and uses a binary search for matching values. This is going to be most beneficial for a full-table scan. With that large a list, a full table scan is likely to be the most efficient method.

Personally, I would start with the IN list. Generating a query with an IN list with 1,000 elements is the same code complexity as a list with 3 elements. If that is not meeting your needs, then try out the performance using a temporary table. If it works much better, then go for that approach.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

I think the first method is more legible and better for maintenance.

If one day you want to switch from a temp table to a persistant table it will be easier.

About the performance, there's already a topic about it:

I suspect that if you do it in a WHERE clause, the planner is choosing a route that is more efficient (ie. either index based, or pre filtered dataset).

You could probably make the join work as fast (if not faster) by adding an index on the two columns (not sure if included columns and multiple column indexes are supported on Postgres yet).

Antoine Stas
  • 229
  • 1
  • 12
  • Thank for your answer. there are severals posts about the topic but you always end up on whether the engine will do optimisation or not. And as many answers are years old it's difficult to know what the engine does in 2017. – Nicolas Manzini Oct 31 '17 at 10:31
  • 1
    You right but I work with very large volume of data (billions) and the difference between both methods in term of performance is not significant. So, I think the right method is the one you prefer in term of legibility. – Antoine Stas Oct 31 '17 at 10:41