0

Problem I'm tasked to resolve is (from my understanding) a typical producer/consumer problem. We have data incoming 24/7/365. The incoming data (call it raw data) is stored in a table and is unusable for the end user. We then select all raw data that has not been processed and start processing one by one. After each unit of data is processed, its stored in another table and is now ready to be consumed by the client application. The process from loading the raw data till persisting processed data takes 2 - 5 seconds on average. But its highly dependent on the third party web services that we use to process the data. If the web services are slow, we are no longer processing data as fast as we're getting it in and accumulate backlog, hence causing our customers to loose live feed. We want to make this process a multithreaded one. From my research I can see that the process can be divided into three discreet parts:

  1. LOADING - A loader task (producer) that runs indefinitely and loads unprocessed data from DB to BlockingCollection<T> (or some other variation of a concurrent collection). My choice of BlockingCollection is due to the fact that it is designed with Producer/Consumer pattern in mind and offers GetConsumingEnumerable() method.

  2. PROCESSING - Multiple consumers that consume data from the above BlockingCollection<T>. In its current implementation I have a Parallel.ForEach loop through GetConsumingEnumerable() that on each iteration starts a task with two task continuations: First step of the task is to call a third party web service, wait for the result and output the result for the second task to consume. Second task does calculations based on the first task's output and outputs the result for the third task, which basically just stores that result into the second BlockingCollection<T> (this one being an output collection). So my consumers are effectively producers too. Ideally each unit of data that has been loaded by the task 1 would be queued for processing in parallel.

  3. PERSISTING - A single consumer runs against the second BlockingCollection mentioned above and persists processed data into database.

Problem I'm facing is the item number 2 from the list above. It does not seem to be fast enough (just by using Parallel.ForEach). I tried inside Parallel.ForEach instead of directly starting a task with continuation, start a wrapping thread that will in turn start the processing task. But this caused OutOfMemory exception, because thread count went out of control and reached 1200 very soon. I also tried scheduling work using ThreadPool with no avail.

Could you please advise if my approach is good enough for what we need done, or is there a better way of doing it?

Dimitri
  • 6,923
  • 4
  • 35
  • 49
  • I suggest using `ConcurrentQueue` instead of `BlockingCollection`. Seems to fit better into your scenario. – Daniel Hilgarth Aug 30 '12 at 12:05
  • BlockingCollection when instantiated using the default constructor uses ConcurrentQueue under the hood, adding more features to it... – Dimitri Aug 30 '12 at 12:07

2 Answers2

3

If the bottleneck is some 3rd party service and this will not handle parallel execution but will queue your request then you cannot do a thing.

But first you can try this:

  • use the ThreadPool or Tasks (those will use ThreadPool too) - don't fire up Threads yourself
  • try to make your request async instead of using the thread exclusively
  • run your service/app through an performance profiler and check where you are "wasting" your time
  • make a spike/check for the 3rd party service and see how it handles parallel requests
  • think about caching the answers from this service (if possible)

That's all I can think of without further info right now.

Random Dev
  • 51,810
  • 9
  • 92
  • 119
  • Is there a way to queue all loaded units of data for processing at the same time? i.e. have one thread/task per unit of data. I will never have more than 300-400 units of data loaded on each iteration. So starting up 300-400 parallel tasks would be what I'm looking for. Parallel.ForEach is failing in doing this. ThreadPool didn't do it either. And starting threads manually, i ran out of them :) – Dimitri Aug 30 '12 at 11:43
  • 1
    (aside from very special edge-cases) you should let the ThreadPool handle how many Tasks it will let you run - IMHO it makes no sense to have 300 tasks running at the same time (this will just slow you way down due to constant tasks-switchings) - the point that you are using a 3rd party service you are waiting fo is for me a indication that the best way to get better perf. and scalability is by going ASYNC (with the service and everything you can think of - like your DB operations and so on) – Random Dev Aug 30 '12 at 11:53
  • Thanks. Will give async a try. I was kinda avoiding async at all cost since each step data goes through is highly dependent on the step prior and I didn't want to wait for async to complete the processing. But now that i think about it I'm still waiting for completion. – Dimitri Aug 30 '12 at 11:57
  • Could you possibly be overwhelming your CPUs? The Microsoft suggested maximum number of concurrent threads per processor (physical/logical) is 25. If you are running on a box with 16 processors (this can be physical cores or logical cores [think hyperthreading which gives you two logical cores per physical core]) you should have no problem having 400 concurrent threads with no fear of overwhelming the CPUs. – Kevin Aug 30 '12 at 11:58
  • Well, the problem with OutOfMemory exception was the fact that I am compiling 32 bit executable, which limits max memory to 2GB. and since one thread takes up to 1 MB of memory, as soon as I was hitting 800 threads I was getting OOM. I couldn't keep my thread count under control. – Dimitri Aug 30 '12 at 12:06
  • if those tasks are just waiting for external data like the database or another service than you will just waste away your server-resources - async will not make the 3rd party service be quicker but it will let you scale your apps a lot better, and the overhead here is minor if you speak of 2-3 seconds to wait for external IO – Random Dev Aug 30 '12 at 12:08
  • You might be interested in the dataflow lib (http://msdn.microsoft.com/en-us/devlabs/gg585582.aspx) and C#'s new async features! – Random Dev Aug 30 '12 at 12:09
  • @Carsten - On another note isn't TaskContinuation an asynchronous execution? If I start a task that calls web service and then define another as task1.ContinueWith, wouldn't task1 become an asynchronous task? – Dimitri Aug 30 '12 at 12:17
  • yes - the second task will be queued/become active when the first one passes (you even tell it to reuse this thread) but the problem you are facing is waiting async. for the replies of your service and your database - for this you have to use the async. versions of the methods you are calling your DB/service with. If you do it right your task will start the operation and then hibernate (giving it's thread back to the TP and as soon as the reply get's back become active again - possible on another thread) - it's very easy with C#5 but doable in older versions as well (using the async patterns) – Random Dev Aug 30 '12 at 12:21
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/16027/discussion-between-dimitri-and-carsten-konig) – Dimitri Aug 30 '12 at 12:31
2

I recently faced a problem which was very much similar to yours, Here's what i did, hope it might help:

  1. It seems like your 1st and 3rd part are rather simple, and can be managed on their respective threads without any problem,
  2. The 2nd part must firstly be started on a new thread, Then use System.Threading.timer, to make your web-service calls, the method that calls the web-service passes the response(result) to the processing method by Invoking it asynchronously and letting it process the data at it's own pace,

this solved my problem, i hope it helps you too, if any doubts ask me, i'll explain it here...

Samy S.Rathore
  • 1,825
  • 3
  • 26
  • 43
  • why would you need a timer for this? – Random Dev Aug 30 '12 at 12:17
  • i assume he need to connect with the webservice on a regular basis..isn't it so?? O.o – Samy S.Rathore Aug 30 '12 at 12:22
  • Parallel.ForEach against BlockingCollection's ConsumingEnumerable will run forever as long as I have items in the collection. No need for timer – Dimitri Aug 30 '12 at 12:23
  • You connect to the webservice as soon as you have a need not whenever a timer ticks - if you do this with a timer you either have to wait with plenty request for the next "cycle" and waste a lot of CPU time on waiting or you don't have to do nothing but connect anyway, wasting your CPU time and your network-connection – Random Dev Aug 30 '12 at 12:29
  • ohh...i m still amateur in the topic, pardon me, what i meant was that the timer helped me to break the long processing into chunks and then assign them to worker threads which join the pool as soon as there work is finished, it prevented from all those threads my code was creating.....causing the OOM exception – Samy S.Rathore Aug 30 '12 at 12:48
  • Then wouldn't you defeat the purpose of multithreading? Wouldn't the timer slow things down? – Dimitri Aug 30 '12 at 13:07
  • u know, the system.threading timer doesn't misses a tick, The timer delegate is specified when the timer is constructed, and cannot be changed. The method does not execute on the thread that created the timer; it executes on a ThreadPool thread supplied by the system but m glad Carston solved ur problem....:) – Samy S.Rathore Aug 31 '12 at 05:30