0

I have a small dilemma. I have a function which goes through a list of Queries in a dataset with their own connection string (could be same or different) and starts filling in all the DataTables and returns a filled DataSet with all the queries filled in.

Currently, this process is done one by one, so if one of the queries took 10mins and the other 3 queries took 2 minutes each, the Run Time will be 16mins.

I was wondering if it's possible to use multithreading in this situation. This should call FillTable all at separate thread and should get the run time down to 10minutes. These are explicitly just Fill DataTable calls (there won't be any update or delete calls).

This is what I got so far:

    public void FillDataSet(ref DataSet Source)
    { 
        foreach (var r in Source.Tables["queries"].Rows)
        {
            string query = r["QueryStatement"].ToString();
            string qSource = r["QuerySource"].ToString();
            string tableName = r["TableName"].ToString();

            DBConnection db = new DBConnection();
            var TempSource = Source;
            taskConnection = Task.Factory.StartNew(() => callDB(db, query, tableName, ref TempSource));
            Source = TempSource;
        }
        Task.WaitAll(taskConnection); 
    }

    private void callDB(DBConnection db, string query, string tableName, ref DataSet Source)
    {
        using (var sql = new SqlConnection(db.ConnectionString))
        {
            sql.Open();
            using (var adp = new SqlDataAdapter(query, sql))
            {
                adp.SelectCommand.CommandTimeout = 0;
                adp.Fill(Source, tableName);
            }
        }
    }

I had to create a TempSource because lambda expression does not like passing in ref of Parameters (I cannot change this). Currently this does not work, not sure what I'm doing wrong.

civic.sir
  • 400
  • 1
  • 9
  • 26
  • A **dilemma** is a choice between two equally bad (or equally good) outcomes. You don't seem to have a dilemma, just a problem that you want to solve. – Enigmativity Feb 01 '17 at 03:52
  • 1
    In any case, it is unlikely that you'll get a performance increase with threading as database calls are IO and threads operate best on data in memory. You should really be asking us if there's a way to speed up your existing code, and not asking us if threading might improve a situation that we can't see. – Enigmativity Feb 01 '17 at 03:55
  • I don't not have control over the queries. Long story short I want each thread to run its own query at the same time so they don't essentially wait for each queries to be completed. For example if I have 5 queries that took 5mins each by threading is should be able to get this done within 5mins vs 25mins – civic.sir Feb 01 '17 at 03:59
  • Yes, that's the common fallacy. In reality you'll find that 5 queries that take 5 minutes to run will take 25 minutes in series and 25 minutes with threading. In fact, it can even be the case that they take much longer with threading. The issue is that have an IO bottleneck and your CPU is just sitting idle much of the time. Take a look at your CPU, if it's maxed out then threading is a good idea. If not, then there's not much point. – Enigmativity Feb 01 '17 at 04:03
  • You need to show your original code for us to be able to give you any help on this. The only other thing to do is to **try it and measure the speed difference.** – Enigmativity Feb 01 '17 at 04:04
  • [SQL Server already is using multiple threads](http://stackoverflow.com/questions/24426299/multi-thread-in-sql) to fulfill your data request. If it is taking too long, consider tuning the queries, adding indices, upgrading the storage, or adding cores to the database server, depending on the bottleneck. By the way your code has all sorts of thread synchronization issues. – John Wu Feb 01 '17 at 04:24
  • @Enigmativity - Incorrect. You are assuming that the DB server is IO bottlenecked. Which may or may not be the case. DB throughput is a function of many factors. For example there could be ample IO available and 8 CPU cores but the query may not be parallelizable so therefore all index seeks, joins etc... would happen on a single CPU core. In this instance 8 threads could provide 8 times the throughput. – tcwicks Feb 01 '17 at 04:34
  • @tcwicks - Yes, you're right. There's a load of factors. Generally speaking in memory activities benefit from threading and IO not so much. The OP needs to try the two approaches and time them to really see. – Enigmativity Feb 01 '17 at 04:55
  • @tcwicks - What I said wasn't incorrect per se, just not 100% correct. – Enigmativity Feb 01 '17 at 05:00
  • @Enigmativity Yes agreed :) Usually threading will not improve throughput because usually IO is the bottleneck. – tcwicks Feb 09 '17 at 02:38

1 Answers1

1

Here is a basic boilerplate you can use. Fill in the bit where I've left a comment:

// Set your connectionstring and execute the query and fill your data here

This is basic - I've used threads instead of threadpool because compared to the amount of work done the overhead of spawning a new thread is minimal. You can extend this if you want by keeping track of the threads and using thread signals etc... to implement way more advanced behavior.

Additionally if you want to pass any extra parameters to the piece of code that does the work add these to the work item definition class.

Note: This does not support multiple parallel executions of the main RunParallel method but you could easily extend it to do this.

public static class RunParallel
{
    const int NumThreadsToRunInParallel = 8;// Tune this for your DB server performance characteristics
    public static void FillDataSet(ref DataSet Source)
    {
        WorkItemDefinition Work;
        foreach (DataRow r in Source.Tables["queries"].Rows)
        {
            Work = new WorkItemDefinition();
            Work.Query = r["QueryStatement"].ToString();
            Work.QSource = r["QuerySource"].ToString();
            Work.TableName = r["TableName"].ToString();
            EnQueueWork(Work);
        }
        System.Threading.ThreadStart NewThreadStart;
        NewThreadStart = new System.Threading.ThreadStart(ProcessPendingWork);
        for (int I = 0; I < NumThreadsToRunInParallel; I ++)
        {
            System.Threading.Thread NewThread;
            NewThread = new System.Threading.Thread(NewThreadStart);
            //NewThread.IsBackground = true; //Do this if you want to allow the application to quit before these threads finish all their work and exit
            ThreadCounterInc();
            NewThread.Start();
        }
        while (ThreadCounterValue > 0)
        {
            System.Threading.Thread.Sleep(1000);
        }
    }

    private static void ProcessPendingWork()
    {
        try
        {
            WorkItemDefinition Work;
            Work = DeQueueWork();
            while (Work != null)
            {
                Work = DeQueueWork();
                DbConnection db = new OdbcConnection();
                // Set your connectionstring and execute the query and fill your data here
            }
        }
        finally
        {
            ThreadCounterDec();
        }
    }

    private static int ThreadCounter = 0;
    private static void ThreadCounterInc()
    {
        lock(SyncRoot)
        {
            ThreadCounter += 1;
        }
    }
    private static void ThreadCounterDec()
    {
        lock (SyncRoot)
        {
            ThreadCounter -= 1;
        }
    }
    private static int ThreadCounterValue
    {
        get
        {
            lock (SyncRoot)
            {
                return ThreadCounter;
            }
        }
    }

    private static object SyncRoot = new object();
    private static Queue<WorkItemDefinition> m_PendingWork = new Queue<WorkItemDefinition>();
    private static Queue<WorkItemDefinition> PendingWork
    {
        get
        {
            return m_PendingWork;
        }
    }

    private static WorkItemDefinition DeQueueWork()
    {
        lock (SyncRoot)
        {
            if (PendingWork.Count > 0) // Catch exception overhead is higher
            {
                return PendingWork.Dequeue();
            }
        }
        return null;
    }

    private static void EnQueueWork(WorkItemDefinition Work)
    {
        lock (SyncRoot)
        {
            PendingWork.Enqueue(Work);
        }
    }

    public class WorkItemDefinition
    {
        public string Query { get; set; }
        public string QSource { get; set; }
        public string TableName { get; set; }
    }
}
tcwicks
  • 495
  • 3
  • 11
  • @civic.sir Your welcome. Just out of curiosity how much speedup were you able to get? Cause in the end Physical IO and some internal SQL resources like hash buckets are the hard limit. Also note that SQL does not always know the IO concurrency level of the disc storage layer, especially fast arrays and SSD's. So you can tweak even more performance out of it by partitioning tables across multiple files, which then forces SQL to use separate IO threads. So if SQL is maxed out but performance monitor shows your disk has plenty of spare capacity then you can try this approach. – tcwicks Feb 02 '17 at 02:25
  • The queries are not that great before this implementation it took around 45mins to run the application because it was sync process. After this implementation came out to be 15mins huge performance increase. Thanks again – civic.sir Feb 02 '17 at 02:28
  • @civic.sir If you want any help boosting the performance further or optimizing the queries / DB schema etc.... Feel free to drop me an email tcwicks at gmail. Also I have an event based threading library as well as a full service bus which might help streamline many things. – tcwicks Feb 02 '17 at 20:59