2

I have folders where approx 3000 new csv files come in on a daily basis, each containing between 50 and 2000 lines of information.

Currently, there is a process in place which picks these files up one at a time and takes each line one at a time and sends it to a stored procedure to insert the contents into a database.

This means that over the course of a day, it can struggle to get through the 3000 files before the next 3000 come in!

I'm looking to improve this process and had the following ideas

  • Use new Parallel feature of C# 4.0 to allow multiple files to be processed at once, still passing through the lines one by one to the stored proc
  • Create a new temporary database table where all the rows in the file can be inserted into at once then call the stored procedure on the newly added rows in the temp table.
  • Split the process into 2 tasks. One job to read data from the files into the temporary database table, the other to process the rows in the temporary table.

Any other ideas on how I could look at doing this? Currently it can take up to 20 seconds per file, I'd really like to improve performance on this considerably.

Graeme
  • 1,657
  • 5
  • 28
  • 46
  • Is it really important that a stored procedure is used to do the insert? – Josh Smeaton Apr 03 '11 at 23:11
  • 3000 files x 1000 lines average = you are sending 3M queries to the database engine, **one at a time**, and executing 3M stored procedures (which are usually interpreted), **one at a time**, and you are asking why it is slow? Skip all your overheads, lock your database, drop your indices, and bulk load. Then re-enable. TPL will not help you here as the bottleneck is the database engine. – Stephen Chung Apr 04 '11 at 09:03
  • are you loading a datawarehouse? – Stephen Chung Apr 04 '11 at 09:04

4 Answers4

3

SQL Server Bulk Insert might be just what you need

http://msdn.microsoft.com/en-us/library/ms188365.aspx

Another issue you may be seeing with all of those inserts taking a long time is every time a row is added, your table may be getting reindexed. A search like this will give lots of good articles on ways to maybe get better performance out of your current procedure http://www.google.com/search?q=sql+insert+performance

JeremyWeir
  • 24,118
  • 10
  • 92
  • 107
1

You can use SQL Server native BCP utility.

More info about BCP utility can be found here: Importing and Exporting Bulk Data by Using the bcp Utility

You can also take a look at: About Bulk Import and Bulk Export Operations

HABJAN
  • 9,212
  • 3
  • 35
  • 59
  • Thanks, I've used bcp before and it does look like it could be something useful for what I need to do. With bcp, do you know if it's possible to add in the name of the file for each row too? e.g. if it's processing file2.csv, it will import all the lines in that file along with the value "file2.csv" to a column in the db. I'm guessing this maybe isn't possible and that you can only insert what is in the file. – Graeme Apr 03 '11 at 21:20
  • @Graeme: I dont think this is possible. If you know your both files schema, then you can create TEMPORARY table, use bcp to fast import both files and then run stored procedure that will move data from temp tables to real tables. Once you get data into SQL server moving data around will be fast. – HABJAN Apr 03 '11 at 21:28
1

Let's say that all 3000 files to be imported have 2000 rows each. That's 6 million rows per day. The bottleneck might not be at the client doing the inserts, but with the database itself. If indexes are enabled on the table(s) in question, inserts could be slow, depending upon how heavily indexed the table(s) is/are. What indications have led you to conclude that it is the database which is waiting around for something to do and that it is the import routine that is lagging behind, rather than the other way around?

Tim
  • 5,371
  • 3
  • 32
  • 41
  • Depending on what that stored procedure is doing, it might be a lot more than 6 million rows. I'm imagining a stored procedure that chops each line of input into pieces for inserting into multiple tables. Could be 2, 3, or 37 tables involved in something like that. – Mike Sherrill 'Cat Recall' Apr 04 '11 at 00:57
  • My point is that you have to identify the problem before you solve it. If you're doing heavy string manipulations in your SP you should check server CPU. You may wish to do that string manipulation in the client not on the server. If the server CPU is being pegged, a multi-threaded client would only exacerbate the problem. If your target tables are heavily indexed, you may wish to disable indexing and re-enable after the import. And what do you mean "imagining"? Is this only a hypothetical problem at this point? – Tim Apr 04 '11 at 11:24
  • I'm not disagreeing with you. I'm just saying that your estimate of 6 million lines of input a day could require 12 million, 18 million, or 222 million database inserts a day. It's not unusual to find processes that more or less take one row from a spreadsheet and break it up for inserting into multiple SQL tables. As for imagining, I'm imagining what the OP's stored procedure might be doing by processing all this data as, it seems, one transaction per line. – Mike Sherrill 'Cat Recall' Apr 04 '11 at 11:40
0

You said

Currently, there is a process in place which picks these files up one at a time and takes each line one at a time and sends it to a stored procedure to insert the contents into a database.

(Emphasis added.)

That seems to mean one line equals one transaction.

Fix that.

  • Pre-process the files so they're acceptable for bulk loading.
  • Pre-process the files so they form valid SQL INSERT statements, and load them that way. (In a single transaction.)

I guess both of those sound like "replace your stored procedure". But the real point is to reduce the number of transactions. Either of those options would reduce the number of transactions for this process from 6 million a day (worst case) to 3000 a day.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185