2

I have a problem stopping a Parallel for each loop.

I am iterating over a set of about 40.000 DataRows retrieved from a table, and I need to stop my loop immediately when I have 100 items in my resultset. The problem is that when I trigger the Stop method on the ParallelLoopState, the iteration is not stopped immediately, causing inconsistency in my resultset ( either to few or to many items).

Is there no way to make sure, that I kill all threads, as soon as I hit stop?

  List<DataRow> rows = new List<DataRow>(dataTable.Select());
  ConcurrentDictionary<string, object> resultSet = new ConcurrentDictionary<string, object>();

  rows.EachParallel(delegate (DataRow row, ParallelLoopState state)
  {
    if (!state.IsStopped)
    {
      using (SqlConnection sqlConnection = new SqlConnection(Global.ConnStr))
      {
        sqlConnection.Open();

        //{
        // Do some processing.......
        //}       

        var sourceKey = "key retrieved from processing";
        if (!resultSet.ContainsKey(sourceKey))
        {
          object myCustomObj = new object();

          resultSet.AddOrUpdate(
          sourceKey,
          myCustomObj,
          (key, oldValue) => myCustomObj);
        }

        if (resultSet.Values.Count == 100)
          state.Stop();
      }
    }
  });
rya
  • 21
  • 3
  • 1
    *Why* are you trying to execute SQL commands in parallel? This won't make a bad query run faster. What is your *actual* problem? – Panagiotis Kanavos Jan 25 '17 at 12:19
  • use locking in the method – Ehsan Sajjad Jan 25 '17 at 12:19
  • 4
    BTW 40K rows is no data at all. Loading all 40K of them when you only want 100 though is a bug. Why don't you just use a `SELECT TOP 100` ? – Panagiotis Kanavos Jan 25 '17 at 12:21
  • @EhsanSajjad what does locking have to do with terminating a `Parallel.ForEach` loop ? – Panagiotis Kanavos Jan 25 '17 at 12:21
  • @UKMonkey not quite - it means that Parallel won't start any new operations. Current operations should actually check their state and terminate early. That's all academic though - the real question is why not just `SELECT TOP 100 ?` – Panagiotis Kanavos Jan 25 '17 at 12:23
  • @PanagiotisKanavos: Top 100 does not work, because he wants 100 rows processed. That depends on an if condition. In case it's not possible to put that if condition into the query, Top 100 cannot be applied. – Thomas Weller Jan 25 '17 at 12:24
  • @ThomasWeller That's correct. I can't do a top 100, otherwise it would have been easy. – rya Jan 25 '17 at 12:28
  • @ThomasWeller that `if` should be a `WHERE` condition. The big problem though is accessing the same data in parallel - it's only going to make things *slower*. Using `Parallel` to speed up a bad query isn't a good solution. – Panagiotis Kanavos Jan 25 '17 at 12:28
  • @rya *what* are you trying to do? `TOP 100` is just an example. The point is that trying to load data in parallel isn't a good solution - it only causes more contention on the server and network. Using proper indexes though is going to speed up things a *LOT*. A `UNION ALL` is going to be faster than multiple independent queries. In the end, you try to read and process 40K rows when you only need 100. Why don't you try to find the logic that will return only those 100 rows? – Panagiotis Kanavos Jan 25 '17 at 12:29
  • @PanagiotisKanavos: we don't know if that new connection connects to the same database or a new one. A `WHERE` statement can only be use in the same database. The `if` statement in code could handle cross-database rules. All in all it's unclear. I'm voting to close – Thomas Weller Jan 25 '17 at 12:31
  • @PanagiotisKanavos The reason I execute the SQL commands in parallel is due to the "Do some processing......." part of the code. There is a lot of different data retrieved from several parts of a large system ( and most of results are cached, so it won't hit a DB everytime) – rya Jan 25 '17 at 12:31
  • @ThomasWeller it's the same connection string. A three-part name is enough to join or UNION across databases – Panagiotis Kanavos Jan 25 '17 at 12:32
  • @rya which can be done after loading the data. A single query is faster than many smaller ones. Anyway, just check your `state.IsStopped` from inside the loop. This won't fix the inappropriate architecture though – Panagiotis Kanavos Jan 25 '17 at 12:35
  • @rya in such cases it's far better to reduce the amount of network calls. A typical design is to use TPL Dataflow to achieve more or less what SSIS does - read the data from the source and feed it to a pipeline of processing blocks. Use a proper SQL query to retrieve *only* the data required, read it quickly with a DataReader and post each row to the pipeline. When you actually have a lot of data it's best to treat it as a stream of data packets that are processed by individual pipeline steps – Panagiotis Kanavos Jan 25 '17 at 12:39
  • @PanagiotisKanavos I have simplified the code in the example. Unfortunately it's not so easy just to reduce the amount of network calls. The initial data load has a very small overhead compared to the whole process. I just wanted to make sure I wasn't missing out on any other way to sort of kill all the threads when my max was reached. I guess I can conclude that this just isn't possible. Thanks for your input and comments. – rya Jan 25 '17 at 12:59
  • 1
    @rya it *is* possible, just not through aborting. Just check `state.IsStopped`. Killing threads is *not* a good idea - aborting is expensive, cleanup is difficult. It's far cheaper and faster to use cooperative cancellation. `Parallel` doesn't use *threads* anyway, it uses *tasks* that process data chunks. – Panagiotis Kanavos Jan 25 '17 at 13:11
  • @PanagiotisKanavos I did try using state.isStopped as you can see in my code sample. It just doesn't help solving the issue. – rya Jan 25 '17 at 13:30
  • @rya it doesn't help to check at the top - `Parallel` won't even start an iteration if `Stop` is called. Check lower down, especially inside your processing code – Panagiotis Kanavos Jan 25 '17 at 13:32
  • @rya post your *actual* processing code. That's what needs to change either for cooperative cancellation or for conversion to a more appropriate pattern – Panagiotis Kanavos Jan 25 '17 at 13:34
  • Checking `state.IsStopped` is similar to checking a `while` condition inside a `while` block to force an early exit - if the condition was triggered, the iteration wouldn't have started anyway – Panagiotis Kanavos Jan 25 '17 at 13:37
  • Related: [Stopping Parallel.ForEach with Cancellation Token and Stop](https://stackoverflow.com/questions/62280038/stopping-parallel-foreach-with-cancellation-token-and-stop) – Theodor Zoulias Jun 09 '20 at 16:52

1 Answers1

6

The documentation page of ParallelLoopState.Stop explains that calling Stop() will prevent new iterations from starting. It won't abort any existing iterations.

Stop() also sets the IsStopped property to true. Long running iterations can check the value of IsStopped and exit prematurely if required.

This is called cooperative cancellation which is far better than aborting threads. Aborting a thread is expensive and makes cleanup difficult. Imagine what would happen if a ThreadAbort exception was thrown just when you wanted to commit your work.

Cooperative cancellation on the other hand allows a task to exit gracefully after commiting or aborting transactions as necessary, closing connections, cleaning up other state and files etc.

Furthermore, Parallel uses tasks, not threads, to process chunks of data. One of those threads is the original thread that started the parallel operation. Aborting wouldn't just waste threadpool threads, it would also kill the main thread.

This is not a bug either - Parallel is meant to solve data parallelism problems, not asynchronous execution. In this scenario, one wants the system to use as many tasks as appropriate to process the data and continue once that processing is complete. 

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • "This is called cooperative cancellation which is far better than aborting threads" - sure... but it doesn't doesn't always accomplish the same thing. If your loop body consists primarily of one call which is going to take a long time (say, uploading a large file) then your token is worthless. If I abort I get what I want. – Ed S. Jan 13 '18 at 03:48
  • @EdS. that confuses parallel processing with asynchronous IO. Besides asynchronous IO methods *do* support cancellation and cancellation tokens where it makes sense. You can't stop an HTTP server from *sending* you data, you can stop awaiting for it to arrive. You can stop reading from the response stream. You can use the cancellation token to clean up the connection after aborting. You can't do any of these things with Thread.Abort – Panagiotis Kanavos Jan 15 '18 at 08:19
  • @EdS. besides, using a *thread* to block on IO would be wasteful even if you didn't have tasks. Win32 offers callbacks, completion ports for async IO and events to await for them. – Panagiotis Kanavos Jan 15 '18 at 08:21
  • @EdS. and finally, in the OP's context, using multiple connections to "speed things up" will actually slow them down due to contention. The OP is trying to execute multiple independent SQL statements *concurrently*. PFOR is used as a quick & very dirty way of doing that. The very idea is wrong - inserting 40K rows is *nothing* if done properly. Performing an UPSERT for each individual row is the wrong way. The fastest way would be to use SqlBulkCopy to copy the data into a staging table then perform a MERGE or UPDATE/INSERT. – Panagiotis Kanavos Jan 15 '18 at 08:24
  • 1
    @Eds. by executing multiple operations in parallel the OP is incurring the network latencies 40000 times, generates almost 40000 times more log operations. By using multiple connections to do this, this leads to *blocking* and delays amongst the connections themselves. – Panagiotis Kanavos Jan 15 '18 at 08:27