7

I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.

Considering below are the queries to pull data from source and target:

Sql Server - select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn

Oracle - select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn

Records in Source : 12377200

Records in Target : 12266800

Following are the approaches i have tried with some statistics:

1) open data reader approach for reading source and target data:

Total jobs running in parallel = 3

Time taken by Job1 = 01:47:25

Time taken by Job1 = 01:47:25

Time taken by Job1 = 01:48:32

There is no index on Id Column.

Major time is spent here: var dr = command.ExecuteReader();

Problems: There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.

2) Chunk by chunk reading approach for reading source and target data:

   Total jobs = 1
   Chunk size : 100000
   Time Taken : 02:02:48
   There is no index on Id Column.

3) Chunk by chunk reading approach for reading source and target data:

   Total jobs = 1
   Chunk size : 100000
   Time Taken : 00:39:40
   Index is present on Id column.

4) open data reader approach for reading source and target data:

   Total jobs = 1
   Index : Yes
   Time: 00:01:43

5) open data reader approach for reading source and target data:

   Total jobs running in parallel = 3
   Index : Yes
   Time: 00:25:12

I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.

We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.

Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.

Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.

Update:

Source Sql Query Execution time in sql server management studio: 00:01:41

Target Sql Query Execution time in sql server management studio:00:01:40

What will be the best way to read huge result set in memory?

Code:

static void Main(string[] args)
        {   
            // Running 3 jobs in parallel
             //Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
        //Task<string>.Factory.StartNew(() => Compare()),
        //Task<string>.Factory.StartNew(() => Compare())
        //};
            Compare();//Run single job
            Console.ReadKey();
        }
public static string Compare()
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            var srcConnection = new SqlConnection("Source Connection String");
            srcConnection.Open();
            var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
            var tgtConnection = new SqlConnection("Target Connection String");
            tgtConnection.Open();
            var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
            var drA = GetReader(command1);
            var drB = GetReader(command2);
            stopwatch.Stop();
            string a = stopwatch.Elapsed.ToString(@"d\.hh\:mm\:ss");
            Console.WriteLine(a);
            return a;
        }
      private static IDataReader GetReader(SqlCommand command)
        {
            command.CommandTimeout = 0;
            return command.ExecuteReader();//Culprit
        }
Lucas B
  • 69
  • 2
  • 7
I Love Stackoverflow
  • 6,738
  • 20
  • 97
  • 216
  • When you run the same query in SQL Management Studio, how long does it take to run? – mjwills Feb 14 '18 at 11:07
  • 1
    The solution I have used with SQL Server is to use the command line query sqlcmd.exe that comes with sql server. I use the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class. – jdweng Feb 14 '18 at 11:27
  • @jdweng Thank you so much for the valauble guidance.Can you please provide me some more info to look in to – I Love Stackoverflow Feb 14 '18 at 11:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165141/discussion-between-user-and-mjwills). – I Love Stackoverflow Feb 14 '18 at 11:42
  • See : https://learn.microsoft.com/en-us/sql/tools/command-prompt-utility-reference-database-engine. I usually use in the Form Project which will lockup when sqlcmd.exe is running. So I put code into a BackgroundWorker. MY code in c# before using sqlcmd.exe was over 30 minutes, with slqcmd.exe 2 minutes. – jdweng Feb 14 '18 at 11:46
  • @jdweng But here is 1 answer(https://stackoverflow.com/questions/3735531/sql-c-best-method-for-executing-a-query) which is saying that we cant do anything with the result given by sqlcmd – I Love Stackoverflow Feb 14 '18 at 11:48
  • They are wrong. There is an option to create a CSV output file. Then read the output file when sqlcmd.exe finished. Done it lots of times. – jdweng Feb 14 '18 at 11:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165143/discussion-between-user-and-jdweng). – I Love Stackoverflow Feb 14 '18 at 11:54
  • 3
    Can you tell us, **why you need all data in memory?** Maybe there is an other way to process the data line by line without the need to to load all into memory. Example: if you whant to find out which data is in one set but not in the other you can sort the data by the identity key and advance each result row one by one. – k3b Feb 14 '18 at 13:32
  • @k3b But i have also done that with data reader and you can see the amount of time that have been spent in that.Moreover this approach is also putting lot of loads on database server also which we dont want.We want to utilize memory of my resource and keep database server free.Database will reside on server and will do this operation on another server – I Love Stackoverflow Feb 14 '18 at 13:34
  • Your database should have a clustered-index on the order-by-colums and you should make shure that the select statement does not lock the table when you want to run 3 jobs in parallel all on the same table. Note: your code opens the connection/sqlCommand/Reader but does not close it and therefore might cause a table lock. – k3b Feb 14 '18 at 13:50
  • @k3b But though with clustered index and data reader,there will be lots of pressure on database when we execute ExecuteReader right.So i want to take that load away and utilize my server resources – I Love Stackoverflow Feb 14 '18 at 13:53
  • @User the proper tool for this is SSIS. If you don't want to use it you'll have to *reproduce its functionality*. If the query is complex or expensive, executing the reader will *always* take a lot of time until you start receiving the first results. What are you trying to do with that code anyway? Exporting millions of rows to a CSV isn't very practical – Panagiotis Kanavos Feb 15 '18 at 10:28
  • @PanagiotisKanavos I am just trying to test out the performance of data reader approach and chunking approach for loading millions of data but as you can see that performance of data reader is poor when i have multiple jobs running in parallel.Is there any better approach to solve this without using SSIS? – I Love Stackoverflow Feb 15 '18 at 10:31
  • @User what does your execution plan look like? *That* will tell you what takes so long. A typical optimization in SSIS is to use `OPTION (FAST 1)` to instruct SQL Server to create an execution plan that's slower overall but starts returning results early. *Don't* try to execute two commands at the same time either - the server is already busy processing one query and pulling dozens of MB from the disk per second. Executing *another* similar query will only lead to contention – Panagiotis Kanavos Feb 15 '18 at 10:32
  • @User then you are using a bad test. You've already flooded your disk in IO requests with the first query. There's no point in running *another* similar query. The various approaches don't fix anything. BTW why do you insist on not using the tool that *already* solves all this? – Panagiotis Kanavos Feb 15 '18 at 10:35
  • @PanagiotisKanavos Sir but i am executing query 1 by 1.so when my first query is executed,after that i am executing target query.This is a poor approach? – I Love Stackoverflow Feb 15 '18 at 10:36
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165211/discussion-between-user-and-panagiotis-kanavos). – I Love Stackoverflow Feb 15 '18 at 10:37
  • @PanagiotisKanavos Why you think i am using a bad test? – I Love Stackoverflow Feb 16 '18 at 09:42
  • One of the problems I've seen is record count, try with SET NOCOUNT OFF – akhileshcoer Feb 17 '18 at 22:05
  • You haven't told us **why** you are doing this. Loading huge data sets into memory is only useful if you intend to do big-data analysis or similar. **What** are you going to do with all that data? Knowing this will make it easier to come up with workable solutions. We also need to know constraints, you keep saying "I want to do this", **do what**? – Lasse V. Karlsen Feb 21 '18 at 14:59
  • @LasseVågsætherKarlsen This is what i will do(https://stackoverflow.com/questions/48583201/compare-2-unordered-recordset-in-memory) after getting 2 recordset in to memory. – I Love Stackoverflow Feb 21 '18 at 16:27
  • 2
    remove order by in your query. –  Feb 22 '18 at 15:23
  • Do you really need all those data in memory? If so, have you considered to use distributed cache (like ncache, memcached or redis) to store the data temporarily? By using distributed cache you will have essentially unlimited memory available (only the hardware is the limit), and still you can treat it as one with no hassle. If you don't have this option, just don't load all data in memory. – Erlangga Hasto Handoko Feb 23 '18 at 03:52

7 Answers7

12

There is nothing (I know of) faster than a DataReader for fetching db records.

Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.

If you want faster you can:

  1. jdwend's suggestion:

Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class

  1. Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/

  2. The easiest (and IMO the best for a SELECT * all) is to throw hardware at it: https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/

Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.

Jeremy Thompson
  • 61,933
  • 36
  • 195
  • 321
0

This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.

// don't forget this using statements
using System.Data;
using System.Data.SqlClient;

// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
    cn.Open();
    using (var cmd = cn.CreateCommand())
    {
        cmd.CommandText = "Select [Fields] From [Table] etc etc";
                          // your SQL statement here.
        using (var adapter = new SqlDataAdapter(cmd))
        {
            adapter.Fill(table);
        } // dispose adapter
    } // dispose cmd
    cn.Close();
} // dispose cn

foreach(DataRow row in table.Rows) 
{
    // do something with the data set.
}
Glenn Ferrie
  • 10,290
  • 3
  • 42
  • 73
0

I think I would deal with this problem in a different way.

But before lets make some assumptions:

  • According to your question description, you will get data from SQL Server and Oracle
  • Each query will return a bunch of data
  • You do not specify what is the point of getting all that data in memory, neither the use of it.
  • I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
  • And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.

Having these foundation points I would process the following:

  • Think at this problem as a data processing
  • Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
  • To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
  • As soon as your logic completes all the data loading (import migration), then you can start processing it.
  • Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
  • Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.
Dryadwoods
  • 2,875
  • 5
  • 42
  • 72
  • This is what i will do(https://stackoverflow.com/questions/48583201/compare-2-unordered-recordset-in-memory) after getting 2 recordset in to memory. – I Love Stackoverflow Feb 21 '18 at 16:28
0

If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).

If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.

After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.

0

Technicalities aside, I think there is a more fundamental problem here.

select [...] order by LinkedColumn

I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.

We would like to keep database server as free as possible

If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!

...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.

Community
  • 1
  • 1
dagnelies
  • 5,203
  • 5
  • 38
  • 56
0

I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.

I took a different approach.

We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.

I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.

That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.

The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.

When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.

So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.

Russ Jackson
  • 1,993
  • 1
  • 18
  • 14
0

If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.

A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified