0

I'm doing what amounts to a glorified mail merge and then file conversion to PDF... Based on .Net 4.5 I see a couple ways I can do the threading. The one using a thread safe queue seems interesting (Plan A), but I can see a potential problem. What do you think? I'll try to keep it short, but put in what is needed.

This works on the assumption that it will take far more time to do the database processing than the PDF conversion.

In both cases, the database processing for each file is done in its own thread/task, but PDF conversion could be done in many single threads/tasks (Plan B) or it can be done in a single long running thread (Plan A). It is that PDF conversion I am wondering about. It is all in a try/catch statement, but that thread must not fail or all fails (Plan A). Do you think that is a good idea? Any suggestions would be appreciated.

/* A class to process a file: */ 
public class c_FileToConvert
{
    public string InFileName { get; set; }
    public int FileProcessingState { get; set; }
    public string ErrorMessage { get; set; }
    public List<string> listData = null;
    c_FileToConvert(string inFileName)
    {
        InFileName = inFileName;
        FileProcessingState = 0;
        ErrorMessage = ""; // yah, yah, yah - String.Empty
        listData = new List<string>();
    }   
    public void doDbProcessing()
    {
        // get the data from database and put strings in this.listData
        DAL.getDataForFile(this.InFileName, this.ErrorMessage); // static function
        if(this.ErrorMessage != "")
            this.FileProcessingState = -1; //fatal error
        else // Open file and append strings to it
        {  
            foreach(string s in this.listData}
                ...
            FileProcessingState = 1; // enum DB_WORK_COMPLETE ...
         }
    }   
    public void doPDFProcessing()
    {
        PDFConverter cPDFConverter = new PDFConverter();
        cPDFConverter.convertToPDF(InFileName, InFileName + ".PDF");
        FileProcessingState = 2; // enum PDF_WORK_COMPLETE ...
    }       
}

/*** These only for Plan A ***/
public ConcurrentQueue<c_FileToConvert> ConncurrentQueueFiles = new ConcurrentQueue<c_FileToConvert>(); 
public bool bProcessPDFs;   

public void doProcessing() // This is the main thread of the Windows Service 
{
    List<c_FileToConvert> listcFileToConvert = new List<c_FileToConvert>();

    /*** Only for Plan A ***/
    bProcessPDFs = true;
    Task task1 = new Task(new Action(startProcessingPDFs)); // Start it and forget it
    task1.Start();

    while(1 == 1)
    {
        List<string> listFileNamesToProcess = new List<string>();
        DAL.getFileNamesToProcessFromDb(listFileNamesToProcess);

        foreach(string s in listFileNamesToProcess)
        {
            c_FileToConvert cFileToConvert = new c_FileToConvert(s);
            listcFileToConvert.Add(cFileToConvert);
        }       

        foreach(c_FileToConvert c in listcFileToConvert)
            if(c.FileProcessingState == 0)
                Thread t = new Thread(new ParameterizedThreadStart(c.doDbProcessing));

        /** This is Plan A - throw it on single long running PDF processing thread **/
        foreach(c_FileToConvert c in listcFileToConvert)
            if(c.FileProcessingState == 1)
                ConncurrentQueueFiles.Enqueue(c);

        /*** This is Plan B - traditional thread for each file conversion ***/              
        foreach(c_FileToConvert c in listcFileToConvert)
            if(c.FileProcessingState == 1)
                Thread t = new Thread(new ParameterizedThreadStart(c.doPDFProcessing));

        int iCount = 0;
        for(int iCount = 0; iCount < c_FileToConvert.Count; iCount++;)
        {
            if((c.FileProcessingState == -1) || (c.FileProcessingState == 2))
            {
                DAL.updateProcessingState(c.FileProcessingState)
                listcFileToConvert.RemoveAt(iCount);
            }
        }
        sleep(1000);
    }
}   
public void startProcessingPDFs() /*** Only for Plan A ***/
{
    while (bProcessPDFs == true)
    {
        if (ConncurrentQueueFiles.IsEmpty == false)
        {
            try
            {
            c_FileToConvert cFileToConvert = null;
            if (ConncurrentQueueFiles.TryDequeue(out cFileToConvert) == true)
                cFileToConvert.doPDFProcessing();
            }
            catch(Exception e)
            {
                cFileToConvert.FileProcessingState = -1;
                cFileToConvert.ErrorMessage = e.message;
            }
        }
    }
}

Plan A seems like a nice solution, but what if the Task fails somehow? Yes, the PDF conversion can be done with individual threads, but I want to reserve them for the database processing.

This was written in a text editor as the simplest code I could, so there may be something, but I think I got the idea across.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Miguelito
  • 302
  • 3
  • 11
  • Is the PDF conversion CPU or IO limited? – Richard Jul 22 '14 at 14:09
  • It should be CPU bound. It takes near a second per document. My assumption is that my bottle neck will be at the database – Miguelito Jul 22 '14 at 15:30
  • You need to know where resource is limiting the performance. Concurrency approaches that work well for making use of all your CPU resources (cores) don't work well with IO (eg. you just end up spreading the same total amount of IO across threads: each conversion takes longer, total throughput hardly changes). – Richard Jul 22 '14 at 15:43
  • OK, to clarify, I am sure the bottleneck will be the database. Testing shows that the PDF file conversion usually takes less than a second maximum. The database access is running around 20 seconds minimum and is expected to go far higher. Really, my question is about using the ConcurrentQueue to feed a single thread or use individual threads. Thanks, M – Miguelito Jul 22 '14 at 19:03
  • Don't use `ConcurrentQueue` directly like that, use a `BlockingCollection`, it uses `ConcurrentQueue` as it's internal store by default. It makes your `ConncurrentQueueFiles.TryDequeue` loops much more efficient because they will block instead of spin when nothing is in the queue to be processed. – Scott Chamberlain Jan 07 '16 at 20:48

1 Answers1

0

How many files are you working with? 10? 100,000? If the number is very large, using 1 thread to run the DB queries for each file is not a good idea.

Threads are a very low-level control flow construct, and I advise you try to avoid a lot of messy and detailed thread spawning, joining, synchronizing, etc. etc. in your application code. Keep it stupidly simple if you can.

How about this: put the data you need for each file in a thread-safe queue. Create another thread-safe queue for results. Spawn some number of threads which repeatedly pull items from the input queue, run the queries, convert to PDF, then push the output into the output queue. The threads should share absolutely nothing but the input and output queues.

You can pick any number of worker threads which you like, or experiment to see what a good number is. Don't create 1 thread for each file -- just pick a number which allows for good CPU and disk utilization.

OR, if your language/libraries have a parallel map operator, use that. It will save you a lot of messing around.

Alex D
  • 29,755
  • 7
  • 80
  • 126
  • I plan for the Db queries to have their own threads, that is why I would like to use only one thread for the PDF conversions – Miguelito Jul 22 '14 at 15:27
  • You can do that if you like, but I would reiterate: one thread for each file is not a good idea. It's better to pick a number of threads which allows for good resource utilization, and use a queue to load-balance between threads. Better yet is if you can use something like a parallel map or map-reduce utility, so you don't have to deal with the details of spawning threads at all. – Alex D Jul 23 '14 at 07:10