1

we are facing the following problem and we are trying to come up with the best possible solution.

We are using SQL Server 2008. We have a table that has more than 600 millions records, and has about 25 columns. One of the columns is an ID and is indexed. We need to get a subset of records from this table. There are mainly 2 cases:

a) the subset contains anywhere between 1000 and 200 millions rows ; the IDs of the rows to get are stored in an indexed table in SQL Server;

b) the subset contains less than 1 million rows ; the IDs of the rows to get are stored in the application's server's memory (.NET web app).

Our challenge is to get this subset as fast as possible. We do need this to happen in a few seconds.

Our solutions so far:

a) we do a join between the two tables. That works but that is not fast enough. The query looks something like SELECT * FROM Big_Table JOIN IndexTable ON Big_Table.ID = IndexTable.ID.

b) we do not really have a solution. We tried to run a WHERE IN query but that simply takes forever if the subset approaches the million rows (SELECT * FROM Big_Table WHERE ID IN (ID1, ID2, ID3...)). A solution could be to store the indexes in a temporary table but then this falls back to case a.

We are trying to optimize SQL Server as much as we can, notably using good indexes and partioning. I am more asking today of what you think the best approach is to extract the subset in both cases (a and b).

Any idea or suggestion is appreciated.

Thanks

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Nicolas
  • 403
  • 1
  • 6
  • 18
  • 4
    Sorry to question this, but do you need to pull the data back? Can you just work in the db? – Preet Sangha Oct 22 '10 at 09:07
  • 2
    Out of curiosity why are you trying to fetch 200M rows in one go? – no_one Oct 22 '10 at 09:09
  • 3
    Do you need to return ALL columns? I'd recommend explicitly naming only the columns you want returned. – AdaTheDev Oct 22 '10 at 09:19
  • Is there enough memory in the machine where sql server runs? It has a huge impact to performance. – Stefan Steinegger Oct 22 '10 at 09:27
  • Seems that you need to change the task: do you really need to pull 200M records back? What do you want to do with 200M records? Even if all columns in the BigTable have datatype int, 200M records are 20 gigabytes, which must be processed in RAM to avoid performance lost with HDD – SQLDev Oct 22 '10 at 10:10
  • Ok I guess I was not clear, sorry. I will only select like 100 records from the subset (to display in a paginated table). So if I am to work with the 200M subset, I will just select 100 records from the table resulting from the join between the 600M table and the 200M index table. For instance, if I want to display records 100000000 to 100000100, the join results in an index scan over 100M rows, and I believe that is what takes time. We have been trying various queries using row_number or rowcount to cut the result down to 100 rows. Server I'm working on only has 8Gb RAM, prod has 70. – Nicolas Oct 22 '10 at 10:39

2 Answers2

3

Since you're doing pagination, couldn't you subselect a range of rows from the IndexTable first, then join this to the big table?

Something like:

select * from big_table where id in (select top 100 id from indextable)
andreialecu
  • 3,639
  • 3
  • 28
  • 36
1

The most obvious question for your problem a) (the SELECT * FROM Big_Table JOIN IndexTable ON Big_Table.ID = IndexTable.ID query) is: are you sure to have indices on the ID column in both tables?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Yes all ID columns are indexed. It takes about 1 min to join the big table (600M) with a 100k index table. This is on a development server with non partioned tables, so I am expecting this time to be about halved on a production server with partioned tables. However even 30s would be too long. – Nicolas Oct 22 '10 at 10:28