10

I am working on an application which calls an external service and has to add all entries of the external collection into a local collection. The problem currently is that the external collection can exceed 1000 records, but the returned search results can only include up to twenty items.

For the sake of speed I figured using a collection of Tasks would be the way forward, so I came up with the code below:

int totalCount = returnedCol.total_count;
while (totalCount > myDict.Count)
{
    int numberOfTasks = // logic to calculate how many tasks to run

    List<Task> taskList = new List<Task>();

    for (int i = 1; i <= numberOfTasks; i++)
    {
        Interlocked.Add(ref pageNumber, pageSize);

        Task<SearchResponse> testTask = Task.Run(() =>
        {
            return ExternalCall.GetData(pageNumber, pageSize);
        });

        Thread.Sleep(100);

        taskList.Add(testTask);
        testTask.ContinueWith(o =>
        {
            foreach (ExternalDataRecord dataiwant in testTask.Result.dataiwant)
            {
                if (!myDict.ContainsKey(dataiwant.id))
                    myDict.GetOrAdd(dataiwant.id, dataiwant);
            }
        });
    }
    Task.WaitAll(taskList.ToArray());
}

However, this does not yield all results. The pageNumber variable is incrementing correctly each time, but it seems that not all task results are being analysed (as the same logic on a single thread on a smaller data set returns all expected results). Also, I have tried declaring individual tasks in a chain (rather than a loop) and the test data is all returned. It seems that the higher the value I pass into Thread.Sleep() the more the results are added into the local collection (but this isn't ideal, as it means the process takes longer!)

Currently in a sample of 600 records I'm only getting about 150-200 added to the myDict collection. Am I missing something obvious?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Chris Wright
  • 281
  • 2
  • 12
  • It seems like "logic to calculate how many tasks to run" might be relevant here. Also, how positive are you that `(pageNumber, pageSize)` is correct? I could see the problem you're facing coming from either stopping your loop before you actually get to the end of the collection or from requesting chunks of data that are largely overlapping. – StriplingWarrior May 11 '16 at 15:28
  • `pageSize` never changes in my code as it is always set to the highest available value (ie 20). I have tested to see if the `pageNumber` value is incrementing correctly each time, and it starts off behaving itself (increasing by twenty each time), but starts to get erratic. Increasing the Thread.Sleep period has an effect on this, but as I'm incrementing it in a thread safe manner I can't see why this would happen – Chris Wright May 11 '16 at 15:34
  • Also, for the sake of a test I have made the external data source a collection of 60 items, and tried retrieving one item at a time via 60 tasks, and I still didn't get all the data in my collection (unless I increase the sleep period) – Chris Wright May 11 '16 at 15:35
  • You don't need to make the `if (!myDict.ContainsKey(dataiwant.id))`, it just slows your process down. just always call `myDict.GetOrAdd(dataiwant.id, dataiwant);`, on new records it adds the new record and on existing records it effectively does nothing because you don't save the value it returns. – Scott Chamberlain May 11 '16 at 15:40
  • Remember that `ContinueWith()` returns another task that isn't complete yet. If you don't wait for that to finish, there's no guarantee your collection is complete yet. Also, is your `pageNumber` local to your method call? – Berin Loritsch May 11 '16 at 15:40

2 Answers2

2

I think if you take a more functional and less imperative approach to your code, you'll be a lot less likely to run into hard-to-understand issues. I think something like this would have the same effect you're going for:

int totalCount = returnedCol.total_count;
var tasks = Enumerable.Range(1, totalCount / pageSize)
    .Select(async page => {
        await Task.Delay(page * 100);
        return ExternalCall.GetData(page, pageSize));
    })
    .ToArray();
myDict = (await Task.WhenAll(tasks))
    .ToDictionary(dataiwant => dataiwant.id);

The above code assumes you still want to wait 100ms between requests for throttling purposes. If you just had that Thread.Sleep() there to try fixing issues you were having, you could further simplify it:

int totalCount = returnedCol.total_count;
var tasks = Enumerable.Range(1, totalCount / pageSize)
    .Select(async page => await Task.Run(() => ExternalCall.GetData(page, pageSize)))
    .ToArray();
myDict = (await Task.WhenAll(tasks))
    .ToDictionary(dataiwant => dataiwant.id);
StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • The one thing that bothers me is the implied need for `Task.Delay`. It really shouldn't be necessary if the code is correct. – Berin Loritsch May 11 '16 at 15:43
  • @BerinLoritsch: I agree. I left it in because I know some services will deny requests if they get hit too quickly by the same source, so it's possible this was there for the sake of throttling, but I added another sample without it. – StriplingWarrior May 11 '16 at 15:46
  • Thanks for the answer, Berin managed to answer first so I started with their code but should I run into further issues I will definitely take a look at this too :) – Chris Wright May 11 '16 at 15:50
  • @ChrisWright: That's fine. I'd be careful about that outer `while` loop you're using, though: if an item gets removed from the server while you're downloading items, you'll end up in an infinite loop until another item gets added. – StriplingWarrior May 11 '16 at 17:19
1

You're missing the fact that ContinueWith() results in another task and you aren't adding that your your taskList.

A better approach would be to use the async/await available since .NET 4.5. It provides a less heavy approach to the solution.

You would change the algorithm to be more like this:

public async Task Process()
{
    int totalCount = returnedCol.total_count;

    while (totalCount > myDict.Count)
    {
        int numberOfTasks = // logic to calculate how many tasks to run

        List<Task> taskList = new List<Task>();

        for (int i = 1; i <= numberOfTasks; i++)
        {
            Interlocked.Add(ref pageNumber, pageSize);

            taskList.Add(ProcessPage(pageNumber, pageSize));
        }

        await Task.WhenAll(taskList.ToArray());
    }
 }

 private async Task ProcessPage(int pageNumber, int pageSize)
 {
       SearchResponse result = await Task.Run(() => 
           ExternalCall.GetData(pageNumber, pageSize)).ConfigureAwait(false);

       foreach (ExternalDataRecord dataiwant in result.dataiwant)
       {
           myDict.GetOrAdd(dataiwant.id, dataiwant);
       }
 }

The async keyword tells the compiler that there will be an await later on. await essentially handles the details around your ContinueWith call. If you really want the ExternalCall to happen in another task, then you would simply await the results from that call.

Berin Loritsch
  • 11,400
  • 4
  • 30
  • 57
  • Thanks, now I have all items in my collection. Now it's time for the real stress test :) – Chris Wright May 11 '16 at 15:49
  • 2
    `ProcessPage` does not have any `await` inside of it so it is not going to be asynchronous. His code will not be ran in parallel because `ProcessPage` will never return early before work is done. You need either `SearchResponse result = await Task.Run(() => ExternalCall.GetData(pageNumber, pageSize)).ConfigureAwait(false);` or even better `SearchResponse result = await ExternalCall.GetDataAsync(pageNumber, pageSize).ConfigureAwait(false);` if it is available. Being a external webservice call you should be able to easily make it async without using `Task.Run(`. – Scott Chamberlain May 11 '16 at 15:54
  • Thanks very much @ScottChamberlain, the test did take longer to run than anticipated so I'll bear this in mind – Chris Wright May 11 '16 at 15:58
  • Re-added the await – Berin Loritsch May 11 '16 at 16:08