27

My company gets a set of CSV files full of bank account info each month that I need to import into a database. Some of these files can be pretty big. For example, one is about 33MB and about 65,000 lines.

Right now I have a symfony/Doctrine app (PHP) that reads these CSV files and imports them into a database. My database has about 35 different tables and on the process of importing, I take these rows, split them up into their constituent objects and insert them into the database. It all works beautifully, except it's slow (each row takes about a quarter second) and it uses a lot of memory.

The memory use is so bad that I have to split up my CSV files. A 20,000-line file barely makes it in. By the time it's near the end, I'm at like 95% memory usage. Importing that 65,000 line file is simply not possible.

I've found symfony to be an exceptional framework for building applications and I normally wouldn't consider using anything else, but in this case I'm willing to throw all my preconceptions out the window in the name of performance. I'm not committed to any specific language, DBMS, or anything.

Stack Overflow doesn't like subjective questions so I'm going to try to make this as un-subjective as possible: for those of you have not just an opinion but experience importing large CSV files, what tools/practices have you used in the past that have been successful?

For example, do you just use Django's ORM/OOP and you haven't had any problems? Or do you read the entire CSV file into memory and prepare a few humongous INSERT statements?

Again, I want not just an opinion, but something that's actually worked for you in the past.

Edit: I'm not just importing an 85-column CSV spreadsheet into one 85-column database table. I'm normalizing the data and putting it into dozens of different tables. For this reason, I can't just use LOAD DATA INFILE (I'm using MySQL) or any other DBMS's feature that just reads in CSV files.

Also, I can't use any Microsoft-specific solutions.

Jason Swett
  • 43,526
  • 67
  • 220
  • 351
  • have you done any performance analysis on the DB end in terms of how the transactions are being created/committed? – NG. Nov 12 '10 at 16:07
  • Nope. My entire import is wrapped in one big transaction. As far as the individual `INSERT` statements themselves go, I haven't done any performance analysis. Any advice there would be appreciated. (However, that alone doesn't solve my memory use problems.) – Jason Swett Nov 12 '10 at 16:09

10 Answers10

19

Forgive me if I'm not exactly understanding your issue correctly, but it seems like you're just trying to get a large amount of CSV data into a SQL database. Is there any reason why you want to use a web app or other code to process the CSV data into INSERT statements? I've had success importing large amounts of CSV data into SQL Server Express (free version) using SQL Server Management Studio and using BULK INSERT statements. A simple bulk insert would look like this:

BULK INSERT [Company].[Transactions]
    FROM "C:\Bank Files\TransactionLog.csv"
    WITH
    (
        FIELDTERMINATOR = '|',
        ROWTERMINATOR = '\n',
        MAXERRORS = 0,
        DATAFILETYPE = 'widechar',
        KEEPIDENTITY
    )
GO
Jeff Camera
  • 5,324
  • 5
  • 44
  • 59
  • +1 Nice answer. This also uses BCP (as does my answer) but yours doesn't require coding. @Jason: If one file populates multiple tables (I think it does) then BCP into a single table and use SQL batch statements to do the split into relevant tables - should still be faster than your current solution – Paul Hadfield Nov 12 '10 at 16:13
  • 1
    The reason is because I'm not just importing an 85-column CSV spreadsheet into one 85-column database table. I'm normalizing the data and putting it into different tables. – Jason Swett Nov 12 '10 at 16:16
  • 1
    Jason: Thanks for the update, it does change things a bit, but the actual answers could still be valid. You could use the faster method available to get data into MySQL and then do the normalisation / splitting within MySQL as batch statements. – Paul Hadfield Nov 12 '10 at 16:22
  • @Jason: You might want to take a look at using a format file with the bulk insert - http://msdn.microsoft.com/en-us/library/ms191516.aspx You might also consider inserting all 85 columns from the CSV into a temporary 85 column table and then processing the data with an SQL script in order to normalize it into other tables. When you're dealing with this much data processing it in the database will be much faster than anything else. – Jeff Camera Nov 12 '10 at 16:33
  • @Paul: You're right that I could do that. The idea isn't entirely appealing to me for some reason, but it may turn out to be the fastest way to do it. – Jason Swett Nov 12 '10 at 16:33
  • @Jason: I do know what you mean - fastest way isn't always the best or most maintainable. Maybe look to add your PHP code to this question (if possible) to see if others can see the cause of the huge memory consumption (my guess would be variables not being released). Then your solution may be good enough for your needs – Paul Hadfield Nov 12 '10 at 16:44
11

I had this exact same problem about 2 weeks ago. I wrote some .NET to do ROW BY ROW inserts and by my calculations with the amount of data I had, it would take around a week to this it this way.

So instead I used a string builder to create one HUGE query and sent it to my relational system all at once. It went from taking a week to taking 5 minutes. Now I don't know what relational system you are using, but with enormous queries you'll probably have to tweak your max_allowed_packet param or similar.

kmarks2
  • 4,755
  • 10
  • 48
  • 77
  • @Kmarks2: sounds an interesting solution but take a look at my solution to this answer - whilst not relevant to Jason it may have really helped you - Bulk Insert is extremely fast and if you're using .NET then you have full control about what data is inserted (i.e. it doesn't have to come from a file) – Paul Hadfield Nov 12 '10 at 16:27
  • Interesting. How many rows was each of your `INSERT` statements inserting? (I'm on MySQL, btw.) – Jason Swett Nov 12 '10 at 16:35
  • 1
    @Jason there were around 1.5 million. – kmarks2 Nov 12 '10 at 16:41
  • Wow, for each `INSERT` statement? And each one took about 5 minutes? And you're using MySQL too? How many columns? (Sorry for all the questions.) – Jason Swett Nov 12 '10 at 16:43
  • @Paul, our case revolved around a lot of DateTime data, and in the applications connect to our rdbms, they strictly require one datapoint every 5 minutes. The client who gave us the data, actually gave us excel...so I had to sanitize it, convert it to CSV, then write a parser that was intelligent enough to account for client retardation (datapoints every 15 minutes, huge gaps, etc) and incorrect data. Spot on though, if your CSV is known to be well formed and correct at the outset, Bulk is defenitely a good starting place. – kmarks2 Nov 12 '10 at 16:46
  • @Jason, Forget to mention, this CSV was spread over 38 text files, with inconsitent formats. Sometimes the columns would get switched from one text file to the next. Had to write some markup at the top of the CSV files so the parser would know what columns needed to go where in the actual final INSERT. A LOT of intervention on the part of the parser. The client basically emptied their trash cans of Excel, crumbled it up and threw it at me. – kmarks2 Nov 12 '10 at 16:48
  • @KMarks2: I would recommend you look at the documentation for SqlBulkCopy as it does not have to use a CSV file. It is a class that you control and can populate it however you please. What it does do is give you a way to get a lot of data into SQL quickly. If you can do it with an insert statement, then you can probably do it with SQLBulkCopy – Paul Hadfield Nov 12 '10 at 16:52
  • Oh, btw it was not 5 minutes for each INSERT. That was 5 minutes for 1.5 million semicolon separated inserts sent as a single query. Thanks @Paul I'll check this out. – kmarks2 Nov 12 '10 at 16:57
  • Ah. And you just sent all those INSERTs to the database at the same time. – Jason Swett Nov 12 '10 at 16:59
6

First: 33MB is not big. MySQL can easily handle data of this size.

As you noticed, row-by-row insertion is slow. Using an ORM on top of that is even slower: there's overhead for building objects, serialization, and so on. Using an ORM to do this across 35 tables is even slower. Don't do this.

You can indeed use LOAD DATA INFILE; just write a script that transforms your data into the desired format, separating it into per-table files in the process. You can then LOAD each file into the proper table. This script can be written in any language.

Aside from that, bulk INSERT (column, ...) VALUES ... also works. Don't guess what your row batch size should be; time it empirically, as the optimal batch size will depend on your particular database setup (server configuration, column types, indices, etc.)

Bulk INSERT is not going to be as fast as LOAD DATA INFILE, and you'll still have to write a script to transform raw data into usable INSERT queries. For this reason, I'd probably do LOAD DATA INFILE if at all possible.

candu
  • 2,827
  • 23
  • 18
4

FWIW the following steps caused a huge speedup of my LOAD DATA INFILE:

SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SESSION tx_isolation='READ-UNCOMMITTED';
SET sql_log_bin = 0;
#LOAD DATA LOCAL INFILE....
SET UNIQUE_CHECKS = 1;
SET FOREIGN_KEY_CHECKS = 1;
SET SESSION tx_isolation='READ-REPEATABLE';

See article here

Sam
  • 4,000
  • 20
  • 27
2

You can use Mysql LOAD DATA INFILE statemnt, it allows you to read data from a text file and import the file's data into a database table very fast..

LOAD DATA INFILE '/opt/lampp/htdocs/sample.csv' INTO TABLE discounts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS (title,@expired_date,discount) SET expired_date = STR_TO_DATE(@expired_date, '%m/%d/%Y');

for more info: http://dev.mysql.com/doc/refman/5.5/en/load-data.html and http://www.mysqltutorial.org/import-csv-file-mysql-table/

R T
  • 4,369
  • 3
  • 38
  • 49
2

I don't like some of the other answers :)

I used to do this at a job.

You write a program to create a big SQL script full of INSERT statements, one per line. Than you run the script. You can save the script for future reference (cheap log). Use gzip and it will shrink the size like 90%.

You don't need any fancy tools and it really doesn't matter what database you are using.

You can do a few hundred Inserts per transaction or all of them in one transaction, it's up to you.

Python is a good language for this, but I'm sure php is fine too.

If you have performance problems some databases like Oracle have a special bulk loading program which is faster than INSERT statements.

You should run out of memory cause you should only be parsing one line at a time. You have no need to hold the whole thing in memory, don't do that!

Glen P
  • 719
  • 5
  • 10
  • Pure Genius, Solved my problem. Simpler Version: Dont import Now, Create sql file and import later (preferably with a sql import tool like http://www.mysqldumper.net/ to handle the actual big import) Convert and Then Import. – iGNEOS Sep 14 '16 at 16:59
1

You can use generator for memory efficient file ready. The small snippet below might help you.

#Method
public function getFileRecords($params)
{
    $fp = fopen('../' . $params['file'] . '.csv', 'r');
    //$header = fgetcsv($fp, 1000, ','); // skip header

    while (($line = fgetcsv($fp, 1000, ',')) != FALSE) {
        $line = array_map(function($str) {
            return str_replace('\N', '', $str);
        }, $line);

        yield $line;
    }

    fclose($fp);

    return;
}

#Implementation
foreach ($yourModel->getFileRecords($params) as $row) {
    // you get row as an assoc array;
    $yourModel->save($row);
}
Mohammad Sharaf Ali
  • 569
  • 1
  • 4
  • 19
1

If you are using Sql Server and have access to .NET then you can write a quick application to use the SQLBulkCopy class. I've used this in previous projects to get a lot of data into SQL very quickly. The SQLBulkCopy class makes use of SQL Server's BCP, so if you're using something other than .NET it may be worth looking into whether that option is open to you too. Not sure if you're using a DB other than SQL Server.

Paul Hadfield
  • 6,088
  • 2
  • 35
  • 56
0

I am reading a CSV file which has close to 1M records and 65 columns. Each 1000 record processed in PHP, there is one big fat MySQL statement that goes into the database. The writing takes no time at all. It's the parsing that does. The memory used to process this uncompressed 600MB file is about 12 MB.

Cyril Joudieh
  • 132
  • 2
  • 15
0

I need to do this too from time to time (import large non-standardized CSVs where each row creates a dozen or so related DB objects) so I wrote a python script where I can specify what goes where and how it's all related. The script then simply generates INSERT statements.

Here it is: csv2db

Disclaimer: I'm basically a noob when it comes to databases, so there might be better ways to accomplish this.

Lukas
  • 1,533
  • 16
  • 21