How does SQLDataReader handle really large queries?

Question

Actually I'm not sure the title accurately describes the question, but I hope it is close enough.

I have some code that performs a SELECT from a database table that I know will result in about 1.5 million rows being selected. The data in each row isn't large - maybe 20 bytes per row. But that's still 30MB of data. Each row contains a customer number, and I need to do something with each customer.

My code looks something like:

SqlConnection conn = new SqlConnection(connString);
SqlCommand command = new SqlCommand("SELECT ... my select goes here", conn);
using (conn)
{
    conn.Open();
    using (SqlDataReader reader = command.ExecuteReader())
    {
        while(reader.Read())
        {
            ... process the customer number here
        }
    }
}

So I just iterate over all the customers returned by the SELECT.

My question is, does that result in multiple reads of the database, or just one? I assume the network buffers aren't big enough to hold 30MB of data, so what does .NET do here? Is the result of the SELECT squirreled away somewhere for the SQLDataReader to nibble off a row every time Read() advances the pointer? Or does it go back to the database?

The reason I'm asking is that the "... process the customer number here" part of the code can take some time, so for 1.5 million customers that code (the while loop above) will take many hours to complete. While that's happening, do I need to worry about other people blocking behind me on the database, or am I safe in the knowledge that I've done my one SELECT from the database and I'm not going back again?

[How DataReader Works](http://stackoverflow.com/questions/1383920/how-datareader-works) — Steve, May 05 '14 at 08:03
so you have a 2-question question? => how does the datareader work and how it will affect my query and or other users on the DB — Schuere, May 05 '14 at 08:10
When dealing with large volumes like this it is sometimes prudent to change the way your code works. What is the processing that is done on the customers? Can any customers be excluded using other criteria before returning them? — HeXanon, May 05 '14 at 08:14
@Schuere Well, yes, I suppose. The latter is to scratch my immediate itch, and the former is to inform me for the future. How would I separate them? — JeffR, May 05 '14 at 08:15
@HeXanon Processing is onerous - it's not simple. Some customers can be excluded, but not immediately. — JeffR, May 05 '14 at 08:16
@JeffR, share with us the process of what you want to do with 1.5M customers, maybe there are better alternatives like SPROCS — Schuere, May 05 '14 at 08:24
@Scheure that's where I was getting to. Some SP's and now CLR SP's remove the network component and can improve performance greatly. With respect to locks you need to fully explain the use of the system and the likelihood that another process will be access or changing the data. Schemes like snapshot isolation could help (or could be worse) depending on your situation. — HeXanon, May 05 '14 at 08:39

score 9 · Answer 1 · answered May 06 '14 at 06:07

9

The select will be executed as a "single, monolithic transaction". The balance of the output is cached in SQL Server and passed out to the network as the protocol determines there is buffer available to receive it. SQL Server will not go back into the data tables each time, though. The state of the data at the point the original SELECT passed over it will be returned to your application. If you have (NOLOCK) specified you will have no further impact on the data. Other people can read & write it; you will not see their changes. You have not finished with SQL Server, however, until the last row is in your app server's buffers, hours later. There will be network traffic at each "I have room for more now, please" but not noticeably more than had the whole 30MB come across all at once.

With large result sets and long-running processes you are better to write your application to process data in batches, even if the infrastructure can support the full query output. It takes fewer resources to answer each batched query. In the case of failure you need only process the remaining rows; you do not have to start again from the beginning. Your application will end up doing fractionally more work overall but each chunk will be less disruptive to the environment.

answered May 06 '14 at 06:07

Michael Green

1,397
1
17
25

Would omitting `WITH (NOLOCK)` cause the `SELECT` to hold locks? Would SQL Server actually spool the output to a temporary table internally if the result set is big enough? – binki Apr 13 '17 at 17:03
1

@binki Generally, omitting NOLOCK will cause the locks to be held until the end of the transaction. Be aware of the transaction's [isolation level](https://learn.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql), however. – Michael Green Apr 18 '17 at 07:29
@binki The optimiser may choose to spool to TempDB as part of the chosen plan. I don't believe it will do so as a result of a busy network buffer, however (but have no evidence either way). – Michael Green Apr 18 '17 at 07:31
Hmm, I see. I managed to get a normal `SELECT` to hold locks on a table by putting it in a transaction after an `UPDATE` and making the client sleep. This really concerns me that clients that have gone away and have yet to have their session be timed out could hold up all users just by not reading their result sets. If you could force temporary table spilling I think that would be prevented at the cost of resources of course. – binki Apr 18 '17 at 13:29

anbuj · Answer 2 · 2014-05-06T05:10:07.053

The request sent once, not every time your reader advances. Then the result will be sent back to the client by several result sets depending on the size.

Default result sets are the most efficient way to transmit results to the client. The only packet sent from the client computer to the server is the original packet with the statement to execute. When the results are sent back to the client, SQL Server puts as many result set rows as it can into each packet, minimizing the number of packets sent to the client.

Ref http://msdn.microsoft.com/en-us/library/ms187602.aspx

When a request is submitted for execution, SQL Server sends result sets back to clients in the following way:

SQL Server receives a network packet from the client containing the Transact-SQL statement or batch of Transact-SQL statements to be executed.
SQL Server compiles and executes the statement or batch.
SQL Server begins putting the rows of the result set, or multiple result sets from a batch or stored procedure, in network packets and sending them to the client. SQL Server puts as many result set rows as possible in each packet.
The packets containing the result set rows are cached in the network buffers of the client. As the client application fetches the rows, the ODBC driver or the OLE DB provider pulls the rows from the network buffers and transfers the data to the client application. The client retrieves the results one row at a time in a forward direction.

A default result set is not given to an application in one large block. The result set is cached in the network buffers on the client. The application fetches through the result set one row at a time. On each fetch, the OLE DB provider or the ODBC driver moves the data from the next row in the network buffer into variables in the application. OLE DB, ODBC, and ADO applications use the same API functions to retrieve the rows that they would use to fetch the rows from a cursor. The SqlClient managed provider uses the SqlDataReader class to expose a default result set. When MultipleActiveResultSets is set to true, more than one SqlDataReader is allowed to be open at a given time.

Ref: http://technet.microsoft.com/en-us/library/ms187602(v=sql.105).aspx

That tells me how the data is transmitted across the network, but it doesn't tell me if all the data is extracted from the database in a single transaction. The question is, when the SELECT is executed, is that done as a single, monolithic transaction? If so, that could be 30MB of data, which is fed to me in the While(reader.Read()) loop in small chunks. Where is the 30MB of data kept? Is it really read from the database and stored in a buffer somewhere, or is the While loop actually going back to the database and causing traffic there? Am I done with the database after the ExecuteReader()? — JeffR, May 05 '14 at 23:54
It's clear that there will be only one request sent for a Transact SQL query. The server execute the query. Then the SQL Server pack the results into network packets and keep on sending back to the client until the whole results been sent. Client stores it in a cache so that your application can read row by row. It means that your 30MB of data being kept on server until the whole results been sent. There will be traffic for receiving network packets, not for sending requests for every time your reader advances. I'll add some extra info to my answer. Please check it out — anbuj, May 06 '14 at 05:17

score 1 · Answer 3 · edited May 23 '17 at 10:30

1

first of all, i'm gonna redirect you to the following question on SO, in which is described how locks etc are processed:

Understanding SQL Server LOCKS on SELECT queries

My first question here is, how many times will you be be running this query. if it is on a daily amount, be sure you pick a time where the least amount of users is working on the database.

the second question is, what are you going to do with the data? Maybe you should keep in mind that when processing 1M+ records, a stored procedure will be faster since it processes everything on the database and will keep traffic low.

edited May 23 '17 at 10:30

Community

1
1

answered May 05 '14 at 08:07

Schuere

1,579
19
33

The SELECT uses the (NOLOCK) hint, and will be run at least once each day. There won't always be 1.5 million rows in the table, but there is now. Once the process is mature there will probably only be 20,000-30,000 rows to process.The processing of each customer is onerous, and involves accessing (both reading from and writing to) other databases. A stored procedure won't work. – JeffR May 05 '14 at 08:20
1

@jeffR, are you while selecting, Updating/Inserting records? – Schuere May 05 '14 at 08:32
1

Yes, but not into the same table. I am updating and inserting records into other tables in multiple databases. The processing I must do for each customer involves reading other data from multiple data sources, manipulating the data, then updating tables based on that processing. It is not something that could be done with stored procedures. – JeffR May 05 '14 at 09:02

Granger · Answer 4 · 2020-10-15T20:11:02.787

Nothing is cached on the client by the DataReader; it attempts to stream data from the server each time you call Read(). The way it goes is this (from experience):

ExecuteReader() blocks until the first Sql statement produces data to return to the client.
NextResult() blocks until the server either
1. Indicates that nothing additional is executing on the server and the "command" is actually complete.
2. XOR a subsequent statement in the command produces data to return to the client.
Read() blocks until the server is able to stream the next record to the client. (Yes, this means that an unordered select will almost always start streaming to the client sooner than an ordered equivalent.)
- E.g. I've watched SqlServer take 15sec to start streaming results (NextResult() returned), then, a while later, block for 15 more seconds on call to Read(); this was on a SELECT with an ORDER BY. (Identical behavior when executing the query from SSMS.)
The overall command text will execute in sync with how DataReader operates.
- I.e. if your command has 2 SELECT statements in it that both return data, the second SELECT will only begin executing on the server when NextResult() is called. But if the first returns zero results, the second will begin execution during ExecuteReader(). (In any case, you will always need to call NextResult() to get the second SELECT's data.)

FYI: My experience is with (MS) Sql2019 and .Net Framework, and IIRC this behavior was still true back in 2013.

So to explicitly answer your question

Your query is a single SELECT statement, which will execute in its own, implicit transaction. As soon as the server is finished finding all the data/rows to return, it will release all locks it may have acquired on the table(s) and at that point, your code won't have any direct impact on other queries hitting those same tables.

However, until you've finished all of your Read() calls, you're still tying up resources on the server as well as commanding exclusive access to that connection from the connection pool. So, in your example, you want to change your while(reader.Read()) loop to just capture all the data into a local object. Then write a subsequent loop, after you've closed the connection, to do your long-running process against that data.

How does SQLDataReader handle really large queries?

4 Answers4

So to explicitly answer your question

Linked