How to insert millions of data of different RDBMS in to SQL Server database with insert statement?

Question

I have two databases in my SQL Server with each database containing 1 single table as of now.

I have 2 database like below :

1) Db1 (MySQL)

2) Db2 (Oracle)

Now what I want to do is fill my database table of SQL Server db1 with data from Db1 from MySQL like below :

Insert into Table1 select * from Table1

Select * from Table1(Mysql Db1) - Data coming from Mysql database

Insert into Table1(Sql server Db1) - Insert data coming from Mysql database considering same schema

I don't want to use sqlbulk copy as I don't want to insert chunk by chunk data. I want to insert all data in 1 go considering millions of data as my operation is just not limited to insert records in database. So user have to sit wait for a long like first millions of data inserting chunk by chunk in database and then again for my further operation which is also long running operation.

So if I have this process speed up then I can have my second operation also speed up considering all records are in my 1 local sql server instance.

Is this possible to achieve in a C# application?

Update: I researched about Linked server as @GorDon Linoff suggested me that linked server can be use to achieve this scenario but based on my research it seems like i cannot create linked server through code.

I want to do this with the help of ado.net.

This is what I am trying to do exactly:

Consider I have 2 different client RDBMS with 2 database and some tables in client premises.

So database is like this :

Sql Server :

Db1

Order
Id      Amount
1       100
2       200
3       300
4       400


Mysql or Oracle :

Db1:

Order
Id      Amount
1       1000
2       2000
3       3000
4       400

Now I want to compare Amount column from source (SQL Server) to destination database (MySQL or Oracle).

I will be use to join this 2 different RDBMS databases tables to compare Amount columns.

In C# what I can do is like fetch chunk by chunk records in my datatable (in memory) then compare this records with the help of code but this will take so much time considering millions of records.

So I want to do something better than this.

Hence I was thinking that i bring out this 2 RDBMS records in my local SQL server instance in 2 databases and then create join query joining this 2 tables based on Id and then take advantage of DBMS processing capability which can compare this millions of records efficiently.

Query like this compares millions of records efficiently :

select SqlServer.Id,Mysql.Id,SqlServer.Amount,Mysql.Amount from SqlServerDb.dbo.Order as SqlServer
Left join MysqlDb.dbo.Order as Mysql on SqlServer.Id=Mysql.Id
where SqlServer.Amount != Mysql.Amount

Above query works when I have this 2 different RDBMS data in my local server instance with database : SqlServerDb and MysqlDb and this will fetch below records whose amount is not matching :

So I am trying to get those records from source(Sql server Db) to MySQL whose Amount column value is not matching.

Expected Output :

Id      Amount
1       1000
2       2000
3       3000

So there is any way to achieve this scenario?

@GordonLinoff Is this possible with linked server what i have mention in my project? — I Love Stackoverflow, Dec 14 '17 at 13:43
@GordonLinoff Thank you so much for the suggestion but it looks like linked server works only with Sql server but it will not work if i will have my database in Mysql or Oracle. — I Love Stackoverflow, Dec 15 '17 at 06:14
I interpreted "my sql server" as being a SQL Server database. — Gordon Linoff, Dec 15 '17 at 12:16
@GordonLinoff I updated my question to include information about linked server — I Love Stackoverflow, Dec 16 '17 at 14:00
I disagree with your opinion that dragging data out of oracle and into sqlserver, over a database link, is going to be faster than the dedicated bulk export/import routines of each database — Caius Jard, Dec 16 '17 at 20:18
You can use SqlBulkCopy, with an IDataReader (custom or not) from your other Database, as explained in the comment by 'Binary Worrier' to my answer here: https://stackoverflow.com/questions/4860187/c-sharp-optimisation-inserting-200-million-rows-into-database — Simon Mourier, Dec 17 '17 at 08:26
What is your problem with doing "chunk by chunk" insert? is it the speed or that it might have data left over if it aborts? — Peter, Dec 18 '17 at 12:00
@Peter Considering i have millions of records it will take lots of time and as i said i have 2 operation.1) Load data from 2 different rdbms tables 2)Compare those 2 tables records based on common column like linking 2 common columns — I Love Stackoverflow, Dec 18 '17 at 12:06
@Learning-Overthinker-Confused i don't think there is much you can do about the speed. but you could wrap the entire thing in a transaction that way you would get all or nothing (if the process is aborted). — Peter, Dec 18 '17 at 12:13
@CaiusJardActually my major operation is bringing data from 2 different RDBMS database tables and then compare records from those 2 tables based on common columns from both this 2 different RDBMS tables. Nowfor this comparision i have to bring this 2 different RDBMS database tables records in memory like in datatable and then join 2 datatables and then compare records but this will take so much time considering millions of records and apart from that some clients might not want their data outside of their premises. — I Love Stackoverflow, Dec 18 '17 at 12:16
@CaiusJard We already have 1 module which is importing data from different data sources but client dont want to run this import process and they dont want their data outside of their premises because of data confedentiality — I Love Stackoverflow, Dec 18 '17 at 12:16
@Peter So here my main operation is i want to compare records of 2 different RDBMS database tables by linking columns.Can you suggest me something for this like i have c# application then what would be the best way to achieve this? — I Love Stackoverflow, Dec 18 '17 at 12:20
@Learning-Overthinker-Confused Is data ever deleted or only added/updated? — Peter, Dec 18 '17 at 12:56
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/161447/discussion-between-learning-overthinker-confused-and-peter). — I Love Stackoverflow, Dec 18 '17 at 12:57
@GordonLinoff I have added more information along with input and expected output. — I Love Stackoverflow, Dec 18 '17 at 13:24
@SimonMourierI have updated my question with more information along with input and output.Now do you think sql bulk copy will make sense? — I Love Stackoverflow, Dec 18 '17 at 13:26
If you are only dealing with these two columns from both data bases why not process this in memory? Rather than writing back to another database. What is the actual goal of the processing? — Adam Carr, Dec 18 '17 at 23:23
@AdamCarr The actual goal of the processing is to store this unmatched amount column records from source to target database in my application database so that client can use my application to get reports in various formats and also perform some other functions on this unmatched amount column records. — I Love Stackoverflow, Dec 19 '17 at 07:05
If you want to compare, no SqlBulkCopy is not a solution, but the question was about inserting, not comparing. That being said, you could indeed make it a two step thing: 1) bulk insert and 2) compare in the same database (with proper index, etc.). That may be fast. It needs some testing I suppose. — Simon Mourier, Dec 19 '17 at 08:12
@SimonMourier Thanks for the reply but right now we are doing this bulk insert only like first bringing client database in our local sql server instance only and then doing comparision with the help of sql query which i have mention in my question.This approach is very fast even if there are millions of records to compare between source and target database but the only problem is we need to bring client database in our local sql server environment and client dont want their confidential data to go outside of their premises. — I Love Stackoverflow, Dec 19 '17 at 10:01
@SimonMourier Another problem is client have to run this bulk import process each time whenever there is a change in database. — I Love Stackoverflow, Dec 19 '17 at 10:01
@JCM But SSIS solution would be paid right and in order to support lots of client we have to buy license for SSIS right? — I Love Stackoverflow, Dec 21 '17 at 06:36
I think the main issue will be that the data will have to be transferred over the wire to determine the joins correctly. If it's just two columns (Id, Amount) then it should be pretty quick. Have you considered querying the data into C# objects and using LINQ? — Adam Carr, Dec 22 '17 at 00:17

score 3 · Answer 1 · answered Dec 16 '17 at 19:49

3

On the SELECT side, create a .csv file (tab-delimited) using SELECT ... INTO OUTFILE ...

On the INSERT side, use LOAD DATA INFILE ... (or whatever the target machine syntax is).

Doing it all at once may be easier to code than chunking, and may (or may not) be faster running.

answered Dec 16 '17 at 19:49

Rick James

135,179
13
127
222

So if my tables contains millions of records then this select into Outputfile will be fast along with this Insert Load Data File? – I Love Stackoverflow Dec 18 '17 at 11:51
I have updated my question to show what exactly i am trying to do – I Love Stackoverflow Dec 18 '17 at 13:25
@Rick James, It is good idea but some time it doesn't work properly – er.irfankhan11 Dec 21 '17 at 05:49
2

@Irfan - "Does not work properly" -- Can you elaborate? – Rick James Dec 22 '17 at 16:33
1

I have used this approach of doing this in chunks. But if you use a DataReader to read source and pass the reader to SqlBulkCopy then you don't need to worry about the size of your data. I have used the reader approach for loading hundreds of millions of rows. With the chunk approach, you need to carefully plan the size of chunk or else your .net process will chew up lot of memory. With the data reader I don't have to worry about any of that. – suresubs Dec 24 '17 at 19:12

score 2 · Answer 2 · edited Dec 21 '17 at 05:45

2

SqlBulkCopy can accept either a DataTable or a System.Data.IDataReader as its input.

Using your query to read the source DB, set up a ADO.Net DataReader on the source MySQL or Oracle DB and pass the reader to the WriteToServer() method of the SqlBulkCopy.

This can copy almost any number of rows without limit. I have copied hundreds of millions of rows using the data reader approach.

edited Dec 21 '17 at 05:45

roelofs

2,132
20
25

answered Dec 21 '17 at 04:29

suresubs

144
5

But every time i have to do this import process for comparision between 2 different rdbms database tables which we dont want to do.Apart from this any other better way to do this? – I Love Stackoverflow Dec 21 '17 at 06:35
It is very fast. It eliminates the chunk by chunk approach. I believe the chunk-by-chunk loading into sql server was your concern. Without actually importing the data into the same sql server instance you cannot do the comparison efficiently. We have built additional framework support in .Net to make this seamless. You can use global temp ## table for loading and comparison easily so you dont need any prep work other than setting up the sql. We have parameterized the process so we need to provide just the source sql, ##temp table definition and the comparison query. – suresubs Dec 21 '17 at 12:31
Do you think big data like hadoop can help in this process or linked server as mention by some users on this answer? – I Love Stackoverflow Dec 23 '17 at 06:23
Any hadoop sort of solution would require more effort and set up of additional software. What I proposed above does not need any additional set up. Especially if you use a global temp table (##) as staging table, you don't even need to go through your DBA. We have done it in an enterprise setting with lots of controls on creating new tables, link servers and all. You can do all of that by writing simple C# code. The speed of load is limited only by your source DB's ability to provide the data. SQLBulkCopy is incredibly fast. Any delta comparison can also be designed into this easily. – suresubs Dec 23 '17 at 12:50
Upvoted for your kind efforts towards helping me but can you please elaborate on your last sentence "Any delta comparison can also be designed into this easily" – I Love Stackoverflow Dec 27 '17 at 08:07
Thanks. To support delta, if you are able to identify the rows that changed in the source table, it is a matter of setting up additional staging table to just load the delta from source and perform a merge with the destination table in SQL Server instance. – suresubs Dec 27 '17 at 12:11
As I mentioned in my previous comments, you can parameterize the process so you supply only the source sql (delta or full extract), ##temp table definition, clean up of staging table, post-processing after the load by sqlbulkcopy, comparison sql. I maintain all of this information as name=value configuration in a separate file that the framework will read. You just set up a new config sql file for each such effort. Essentially you are using sql server as an engine for comparison - not so much for persistence. – suresubs Dec 27 '17 at 12:12
My application dont know anything about the client database tables hence my application would only be responsible to store those changed records from source to destination table.I want to write implement this 2 comparision logic from source table to destination table on demand. – I Love Stackoverflow Dec 29 '17 at 10:43

score 1 · Answer 3 · answered Dec 18 '17 at 13:29

1

What about adding a changed date in the remote database.

Then you could get all rows that have changed since the last sync and just compare those?

answered Dec 18 '17 at 13:29

Peter

37,042
39
142
198

Sorry but this will be done through our application where we want to store this kind of records to generate reports and have some further operations too. – I Love Stackoverflow Dec 18 '17 at 13:31

score 1 · Answer 4 · answered Dec 19 '17 at 10:57

1

First of all do not use linked server. It is tempting but it will more trouble than it is bringing on the table. Like updates and inserts will fetch all of the target db to source db and do insert/update and post all data to target back.

As far as I understand you are trying to copy changed data to target system for some stuff.

I recommend using a timestamp column on source table. When anything changes on source table timestamp column is updated by sql server.

On target, get max ID and max timestamp. two queries at max.

On source, rows where source.ID <= target.MaxID && source.timestamp >= target.MaxTimeTamp is true, are the rows that changed after last sync (need update). And rows where source.ID > target.MaxID is true, are the rows that are inserted after last sync.

Now you do not have to compare two worlds, and you just got all updates and inserts.

answered Dec 19 '17 at 10:57

Erdogan Kurtur

3,630
21
39

Sorry the process is like when 1 database is migrated from 1 system(Suppose Sql server) to another(Mysql) or may be Sql Server(Server1) to Sql server(Server on another location) there might be change of some data problem like as i have shown you in my sample records How Amount column for 3 records got change from source(Sql server) to destination(Mysql). So now with the help of my application i want to store and find out this type of Problematic records and so have some operation like reports and other things – I Love Stackoverflow Dec 19 '17 at 11:17
Moreover there are 3 databases invole here.2 Client database(Any Rdbms like oracle,Sql server,Mysql etc) which i want to compare and 1 my application database(strictly Sql Server) in which i want to store problematic records from this 2 databases for showing reports on my application – I Love Stackoverflow Dec 19 '17 at 11:18
My point is that you do not need to compare if you can guarantee that timestamp column (assuming copied as a binary value). Just comparing ID and timestamp columns can give you any record that changed, and best of it does not depend on the db you are running on target. you requirement is to find modified records by comparing them. I'm just trying to achieve same result without compare. this solution is what we are using when we are publishing bulk data to our customers. they receive inserted/updated data, and update their db. – Erdogan Kurtur Dec 25 '17 at 14:23
But those will be client databases to whom we cannot say things like what you are suggesting.Hence we have to find some other way of comparing 2 different rdbms tables.We are thinking to use hadoop hdfs and map reduce but problem here is there is no way to create joins between 2 hdfs file like how we link 2 tables in rdbms though there are some tools like hive which allow us to write sql query but they dont provide apis to run this comparision on demand – I Love Stackoverflow Dec 29 '17 at 10:25

score 1 · Answer 5 · answered Dec 19 '17 at 11:36

1

You need to create a linked server connection using ODBC and the proper driver, after that you can execute the queries using openquery.

Take a look at openquery:

https://msdn.microsoft.com/en-us/library/ms188427(v=sql.120).aspx

answered Dec 19 '17 at 11:36

justcode

1,562
3
14
25

But i already mentioned about linked server that i can create linked server through code.Is it possible? – I Love Stackoverflow Dec 19 '17 at 11:54
I think you can use the procedure sp_addlinkedserver (https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-addlinkedserver-transact-sql) to create the linked server, in this site you have a working example https://www.mssqltips.com/sqlservertip/4570/access-mysql-data-from-sql-server-via-a-linked-server/. – justcode Dec 20 '17 at 12:38
This linked servers will stay on the MSSQL server, if you dont need them anymore after running your queries you need to delete them. – justcode Dec 20 '17 at 12:40
But do you think Linkedserver will help me achieve my expected output considering 2 different RDBMS database tables comparision? – I Love Stackoverflow Dec 20 '17 at 13:19
1

Yes it will work as a normal table, you are able to insert, update, delete and all the necessary operations. I have done that before. But to access the tables from the other database you will always need to use the openquery, like SELECT * FROM OPENQUERY(linkedserver_name,"SELECT * FROM TABLE1") – justcode Dec 20 '17 at 14:38
But can you please show me how i will create a join query(basically linking on 2 same columns like SqlServer.Id=Mysql.Id) between 2 different RDBMS database tables like how i can done in my query. – I Love Stackoverflow Dec 21 '17 at 06:39
SELECT T1.* FROM TABLE1 T1, OPENQUERY(linked_server_name, "SELECT * FROM TABLE2") T2 WHERE T1.ID = T2.ID; – justcode Dec 21 '17 at 10:28
So linked server is not just configurable through SSMS.I can configure it through my code also.Because i want this operation to be fully controlled with my code – I Love Stackoverflow Dec 21 '17 at 10:32

score 1 · Answer 6 · answered Dec 23 '17 at 02:21

1

Yes, SQL Server is very efficient when it's working with sets so let's keep that in play.

In a nutshell, what I'm pitching is

Load data from the source to a staging table on the target database (staging table = table to temporarily hold raw data from the source table, same structure as the source table... add tracking columns to taste). This will be done by your C# code... select from source_table into DataTable then SqlBulkCopy to the staging table.
Have a stored proc on the target database to reconcile the data between your target table and the staging table. Your C# code calls the stored proc.

Given that you're talking about millions of rows, another thing that can make things faster is dropping indices on the staging table before inserting to it and recreating those after the inserts and before any select is performed.

answered Dec 23 '17 at 02:21

Goose

546
3
7

But dont you think this is kind of additional step(overhead) that first i have to do sql bulk copy by brining chunk by chunk data and then storing in sql server staging table for 2 different rdbms database tables right for comparision – I Love Stackoverflow Dec 23 '17 at 06:20
Yes for hundreds / few thousand rows... no for millions. Loose analogy... if I put a deck of cards on your kitchen table and another deck at your living room then given you the task of finding the matching card from the other deck. Will you finish faster by running back and forth 52 times or will it be faster if you bring the entire deck from the kitchen to the living room then finding a match when you have both decks at the same place. – Goose Dec 23 '17 at 07:05
Ok i agree but i didnt understand your 2nd point.Can you tell me something more about that – I Love Stackoverflow Dec 23 '17 at 07:20
The 2nd point is a stored procedure that would contain your query and the necessary update statements that would modify your data. – Goose Dec 23 '17 at 08:55

How to insert millions of data of different RDBMS in to SQL Server database with insert statement?

6 Answers6