1

I am writing a C# method that streams a large number of rows from a SQL query (not a straight DBSet<T>!), performs some transformation on them, then writes the results into a MongoDB database. I am trying to make this run as quickly as possible, and due to reasonably high network latency I want to avoid making multiple return trips to SQL Server.

I have a class, StreamlinedCrmTicket, that represents a DTO onto which EF projects the result of a raw SQL query that takes no parameterised inputs. I am using EF Core 3.1.6 and the .Set<StreamlinedCrmTicket> technique to execute the raw SQL query. I'm then using .AsNoTracking() for the performance gains in light of this only ever being a read operation. Finally, I am calling .AsAsyncEnumerable(), and wrapping the whole shebang in an await foreach which, in turn, lives inside a method marked async.

The whole thing looks like this:

await foreach (var ticket in _affinityContext.Set<StreamlinedCrmTicket>().FromSqlRaw(query).AsNoTracking().AsAsyncEnumerable().WithCancellation(cancellationToken))
{
   // Do something with each ticket.
}

The source table for my raw SQL query currently contains around 1.2 million rows. There are a handful of joins that seem to cause very little change in the execution time of the query when measured using SSMS.

When I execute my code, it appears that EF starts the query, but the body of the foreach loop, no matter what it contains, does not begin executing until the entire query has been executed and the resultset received from SQL Server. This defeats the object of my use of IAsyncEnumerable! My understanding was that IAsyncEnumerable should allow me to act on rows (or entities) as they return from the database without having to wait for the entire resultset.

Some ideas that support my theory that this is not behaving asynchronously at the moment are:

  • As soon as the call to _affinityContext.Set<StreamlinedCrmTicket>().FromSqlRaw(query).AsNoTracking().AsAsyncEnumerable().WithCancellation(cancellationToken) is complete, a large amount of network IO begins. I can see in Performance Monitor on my Windows machine that the IO is a SQL Server connection to the server my code should be running against.
  • I swapped the body of my foreach loop to a very simple one, and it only runs once the network IO has stopped.
  • I removed all ORDER BY clauses from the SQL query - row ordering is irrelevant in this use case, and I was concerned that this might be causing the query to take a long time before the first row was returned thus giving the illusion of running synchronously. However, the network IO suggests this wasn't (and isn't - I left the clause out!) the case.
  • If I add a TOP 1000 to the SELECT statement in my query, it executes much more rapidly.

I'm unsure why this is running synchronously, and documentation on Microsoft's site seems poor!

Daniel Arkley
  • 164
  • 1
  • 14
  • 3
    Are you sure your query [can be streamed](https://dba.stackexchange.com/a/203880/5203) in the first place? – GSerg Jul 27 '20 at 20:41
  • 1
    It's not running synchronously, it *is* running asynchronously. The method being asynchronously is a question of whether or not the method that calls this is blocked, or instead is given a task that indicates when the operation is done, and the latter is happening here, so it's asynchronous. Whether earlier values in the sequences are provided before the final values are computed is a separate property entirely. – Servy Jul 27 '20 at 20:47
  • @GSerg I have a hash match in the query :( good shout! Looks like I'm back to the good old Skip and Take days to do this in batch processing. – Daniel Arkley Jul 27 '20 at 21:02

2 Answers2

0

For reference, GSerg suggested in the comments that my query might not be streamable. In this case, the use of a LEFT OUTER JOIN in the query resulted in the use of Hash Match in SQL Server. This prevents the query resultset from being streamable.

Daniel Arkley
  • 164
  • 1
  • 14
-1

Your raw SQL does not provide a cursor for pagination, so SQL Server has to return the entire result in one go.

TimTIM Wong
  • 788
  • 5
  • 16
  • 1
    Why are you writing raw SQL if you are using EF? – TimTIM Wong Jul 27 '20 at 21:34
  • Returning the result set in one go is the desired outcome, but I was hoping that the use of IAsyncEnumerable would allow me to start iterating as the resultset was retrieved from SQL Server. The requirement for raw SQL comes from the use of an absolutely outrageously complicated query in the source database which I cannot build into a view and cannot adequately express in my EF model. The query, unfortunately, uses inline CASE statements, joins based on discriminators and all kinds of other fun stuff. – Daniel Arkley Jul 27 '20 at 22:16
  • Well, can you provide the schema and SQL and see if we can optimize the queries? – TimTIM Wong Jul 29 '20 at 00:59
  • I’ve made a change of approach now; the query which is in the database as a view already returned loads of data my project doesn’t need, so I’ve written simple queries to replace the views. Unfortunately the schema can’t be changed (I don’t have access and the system is maintained by a third party who won’t support us making modifications) so raw SQL is my only option. – Daniel Arkley Jul 29 '20 at 01:44