1

I have a scenario in which i have to process the multiple files(e.g. 30) parallel based on the processor cores. I have to assign these files to separate tasks based on no of processor cores. I don't know how to make a start and end limit of each task to process. For example each and every task knows how many files it has to process.

    private void ProcessFiles(object e)
    {
        try
        {
            var diectoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;

            var FilePaths = Directory.EnumerateFiles(diectoryPath);
            int numCores = System.Environment.ProcessorCount;
            int NoOfTasks = FilePaths.Count() > numCores ? (FilePaths.Count()/ numCores) : FilePaths.Count();


            for (int i = 0; i < NoOfTasks; i++)
            {
                Task.Factory.StartNew(
                        () =>
                        {
                            int startIndex = 0, endIndex = 0;
                            for (int Count = startIndex; Count < endIndex; Count++)
                            {
                                this.ProcessFile(FilePaths);
                            }
                        });

            }
        }
        catch (Exception ex)
        {
            throw;
        }
    }
ehafeez
  • 35
  • 1
  • 8
  • 2
    The task-parallel library will deal with multi-core architecture under the hood. We shouldn't need to concern ourselves with the available system cores when creating tasks – William Dec 05 '15 at 01:02
  • 2
    I'm definitely not an expert with the Task Parallel Library, but isn't the TPL supposed to handle the number of CPU cores by itself, and determine the best way to "split" the workload ? – Luc Morin Dec 05 '15 at 01:02
  • 3
    Here the problem might be that if there are 100 files in the directory, it won't be a good idea to create 100 tasks. So you could use Parallel.For loop. It will internally make partitions and will establish parallel processing by relying on it's own partitioner. – Usman Dec 05 '15 at 01:13
  • Please note, that the knowledge of tasks and concurrent algorithms requires also the knowledge of concurrent collections and thread/concurrently safe data exchange algorithms. And here you are accessing var FilePaths, IEnumerable, from multiple tasks concurrently. Bad idea, really. – ipavlu Dec 05 '15 at 01:23
  • It is not always a good idea to depend only on the default behavior of TPL. In many cases, there is a need to limit the level of concurrency, and here it could be the case. – ipavlu Dec 05 '15 at 01:30

2 Answers2

2

For problems such as yours, there are concurrent data structures available in C#. You want to use BlockingCollection and store all the file names in it.

Your idea of calculating the number of tasks by using the number of cores available on the machine is not very good. Why? Because ProcessFile() may not take the same time for each file. So, it would be better to start the number of tasks as the number of cores you have. Then, let each task read file name one by one from the BlockingCollection and then process the file, until the BlockingCollection is empty.

try
{
    var directoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;

    var filePaths = CreateBlockingCollection(directoryPath);
    //Start the same #tasks as the #cores (Assuming that #files > #cores)
    int taskCount = System.Environment.ProcessorCount;

    for (int i = 0; i < taskCount; i++)
    {
        Task.Factory.StartNew(
                () =>
                {
                    string fileName;
                    while (!filePaths.IsCompleted)
                    {
                         if (!filePaths.TryTake(out fileName)) continue;
                         this.ProcessFile(fileName);
                    }
                });
     }
}

And the CreateBlockingCollection() would be as follows:

private BlockingCollection<string> CreateBlockingCollection(string path)
{
    var allFiles = Directory.EnumerateFiles(path);
    var filePaths = new BlockingCollection<string>(allFiles.Count);
    foreach(var fileName in allFiles)
    {
        filePaths.Add(fileName);
    }
    filePaths.CompleteAdding();
    return filePaths;
}

You will have to modify your ProcessFile() to receive a file name now instead of taking all the file paths and processing its chunk.

The advantage of this approach is that now your CPU won't be over or under subscribed and the load will be evenly balanced too.


I haven't run the code myself, so there might be some syntax error in my code. Feel free to correct the error, if you come across any.

displayName
  • 13,888
  • 8
  • 60
  • 75
  • Thanks mate, but how I can make the order synchronous because i have to process file in order in which it comes to process. Also in case of exception how I will process the faulty files. Also i have to pass the processed files to UI thread to update the GUI with files contents. – ehafeez Dec 05 '15 at 11:36
  • 1
    You can save the oder with passing into a `BlockingCollection` the queue, like in [this case][http://stackoverflow.com/a/3825322/213550]. You can examine the `Exception` property for each task for each file, and see if it not null. You can `ContinueWith` or `WhenAny` methods to update the UI. – VMAtm Dec 05 '15 at 12:15
  • @ehafeez: VMAtm's suggestions are correct. Try them. – displayName Dec 05 '15 at 15:43
1

Based on my admittedly limited understanding of the TPL, I think your code could be rewritten as such:

private void ProcessFiles(object e)
{
    try
    {
        var diectoryPath = _Configurations.Descendants().SingleOrDefault(Pr => Pr.Name == "DirectoryPath").Value;

        var FilePaths = Directory.EnumerateFiles(diectoryPath);

        Parallel.ForEach(FilePaths, path => this.ProcessFile(path));

    }
    catch (Exception ex)
    {
        throw;
    }
}

regards

Luc Morin
  • 5,302
  • 20
  • 39
  • Files can be 1000 at one time so i can't use parallel.foreach because I have to update the GUI at real time once file is processed. – ehafeez Dec 05 '15 at 01:31
  • This was not in your OP. As you might imagine, we don't have crystal balls in which to read all your requirements. Next time, please make sure to include ALL of your requirements in your question, instead of adding them one at a time after answers have been given. Thank you. – Luc Morin Dec 05 '15 at 19:54