Handling large SQL select queries / Read sql data in chunks

Question

I'm using .Net 4.0 and SQL server 2008 R2.

I'm running a big SQL select query which returns millions of results and takes up a long time to fully run.

Does anyone know how can I read only some of the results returned by the query without having to wait for the whole query to complete?

In other words, I want to read the first by 10,000 records chunks while the query still runs and getting the next results.

The main issue is probably the *returns millions of results* - that is a database design smell.... do you really need to return millions of rows?? WHY?? Couldn't you process those on the database server and then return a few hundred results instead?? — marc_s, Apr 20 '11 at 07:00
@tobias86 unless you are familiar with the needs of that system, the archiving claim is meaningless — Marc Gravell, Apr 20 '11 at 07:26
I need all the millions of results in my program. I do not filter anything on the code. I will consider adding an identity column to my database and try retrieving in chunks if no better solution comes up — Omri, Apr 20 '11 at 18:11

score 14 · Accepted Answer · answered Apr 20 '11 at 07:05

14

It depends in part on whether the query itself is streaming, or whether it does lots of work in temporary tables then (finally) starts returning data. You can't do much in the second scenario except re-write the query; however, in the first case an iterator block would usually help, i.e.

public IEnumerable<Foo> GetData() {
     // not shown; building command etc
     using(var reader = cmd.ExecuteReader()) {
         while(reader.Read()) {
             Foo foo = // not shown; materialize Foo from reader
             yield return foo;
         }
     }
}

This is now a streaming iterator - you can foreach over it and it will retrieve records live from the incoming TDS data without buffering all the data first.

If you (perhaps wisely) don't want to write your own materialization code, there are tools that will do this for you - for example, LINQ-to-SQL's ExecuteQuery<T>(tsql, args) will do the above pain-free.

answered Apr 20 '11 at 07:05

Marc Gravell

1,026,079
266
2,566
2,900

+1. Basically the reader starts when the server starts returning results. If it takes long then either the query is inefficient (takea logn time to start with data) or it is more compelx and it uses temporary tables as the answer says.... and then at the enddoes a select, then you need to rewrite it. – TomTom Apr 20 '11 at 07:10
Using the ExecuteReader() method will cause the thread to block until the query finishes. I want to be able to restore some data while the query continues... – Omri Apr 20 '11 at 18:02
@Omri no, it won't; unless (as stated) your query involves temporary tables etc. It is a *reader*; it will start returning data row-by-row as it becomes available in the TDS stream. The point about temp tables is that if your query itself takes a log time before it returns even the first row, it can't help with that - you need to fix your query. – Marc Gravell Apr 20 '11 at 18:08
@MarcGravell any idea why a query might take 1m in SSMS to return all rows, but take a little less than 1m in ADO just to start streaming the first row (and then continue to stream the rest) from the same query? It seems almost like SQL Server is writing all the results to temp table before streaming out. – tuespetre Feb 27 '15 at 15:29
@tuespetre most often, that means either blocking or different `SET` options; `SET` options can make *significant* differences to how a query behaves – Marc Gravell Feb 27 '15 at 15:31
@MarcGravell thanks for the quick reply. I've tried tweaking `SET` options and using `OPTION(RECOMPILE)` in addition; SSMS always returns the first of the results in 1-2s while ADO always takes around the full time that SSMS takes to return all of the results just to return the first result. I might go ask on DBA. Thanks again – tuespetre Feb 27 '15 at 15:44
@MarcGravell didn't realize `buffered` was `true` by default with Dapper until now ;) I'm sure I read through that before but it didn't occur to me. – tuespetre Feb 27 '15 at 16:48
@MarcGravell will this yield approach good for lets say 10GB data? Could you suggest any best approach for such scenarios? Here is the actual case https://stackoverflow.com/questions/53846265/processing-huge-data-from-sql-server – PSR Dec 19 '18 at 10:06

score 2 · Answer 2 · answered Apr 20 '11 at 07:12

2

You'd need to use data paging.

SQL Server has the TOP clause (SQL TOP 10 a,b,c from d) and BETWEEN:

SELECT TOP 10000 a,b,c from d BETWEEN X and Y

Having this, I guess you'd be able of retrieving an N number of rows, do some partial processing, then load next N number of rows and so on.

This can be achieved by implementing a multithreaded solution: one will be retrieving results while the other will asynchronously wait for data and it'll be doing some processing.

answered Apr 20 '11 at 07:12

Matías Fidemraizer

63,804
18
124
206

1

`BETWEEN` can be tricky to use with paging unless you either introduce `ROW_NUMBER`, or you know that you have an unbroken monotonically increasing sequence (typically `IDENTITY`, but deleted/filtered data may cause issues) – Marc Gravell Apr 20 '11 at 07:25
You're right... but we don't know what kind of identifiers he's using right now. I've tried to give him some idea. BTW, it'd be with row number, shouldn't it? – Matías Fidemraizer Apr 20 '11 at 07:27
Doing so will be much more complicated because I need to keep track of where I left off in the previous round/chunk. That may be problematic because I don't necessary have an increasing sequence number on all the rows. – Omri Apr 20 '11 at 18:08

score 0 · Answer 3 · answered Apr 20 '11 at 07:06

0

if you really have to process millions of records Why dont you load 10,000 each round process them and then load the next 10,000? if not consider using the DBMS to filter the data before loading it as the performance on the database is much better than in you logic leyer.

Or follow a lazy load concept and load only Ids to which you load the actual data only when you need it.

answered Apr 20 '11 at 07:06

CloudyMarble

36,908
70
97
130

2

That does not solve the problem of the query traking a long time. Processing them in C# is easy - a datareader is fully capable to deal with 100 million rows without a problem and delay. THe processing on the database needs to bel ooked into. And sometimes you need so many. I work with financial market data. an option dump for analysis for an instrument has about 10 million rows, and if I need them I need them. – TomTom Apr 20 '11 at 07:11

Handling large SQL select queries / Read sql data in chunks

3 Answers3

Linked