Performance tips on importing large sets of data

Question

I have a function that lets users import contacts (email address and name). Some users imports files with around 70,000 contacts. Can be either xls or csv. This is what I have now.

The user selects a file (from their computer) that they want to import contacts from.
I save the file on the server and creat a database entry with a reference to the file location.
Amazon SQS is used to handle this in the background.
The first time the job runs I process the file, only saving rows with email address and name (if found). The data is saved to a json file in the same location and cached. Then I release the job back in the queue.
Now the contacts are ready to be imported. I take 1000 contacts in each job and saves each contact in its own row in the database. I use array_slice to skip contacts in the json file. The skip count is saved to the database.
When no contacts left, the job is deleted and everything is done.

This is pretty much the whole process. I also have a check (database lookup) to check for duplicates. Only unique email addresses is allowed.

The problem I have is that the job seems to take too much time and I get timeouts. This leads to the import taking to much time.

So my question is: is there anything I can do better?

Let me know if you need anything else. I don't have much experience with big data and many users.

EDIT: I don't need code. What I'm after is like, is the problem a server issue? Maybe moving the database to its own server will do the trick? Or should I just use a different approach?

EDIT 2: The user can see the progress of the import. So I need to calculate the number of contacts and to do that I need to first filter away rows with no email address. And I also trims it and the name column. When I did this I found easier to save the new set of data to a JSON file.

EDIT 3: The timeouts happens when saving users to the database, not in the initial processing and creation of the json file.

EDIT 4: One way to speed it the job might be to save it into chunks from the beginning (in the first processing). This way I don't need to handle a skip counter and I don't have to use array_slice on the large data set. Also now when I think about it, it's kind of stupid to save it to a json file and then cache it. Why don't cache the array from the beginning?

why create the json file at all, why not go from csv file to db — , Apr 27 '15 at 01:03
That's a good question @Dagon . The user can see the progress of the import. So I need to calculate the number of contacts and to do that I need to first filter away rows with no email address. And I also trims it and the name column. When I did this I found easier to save the new set of data to a JSON file. — tbleckert, Apr 27 '15 at 01:07
i wouldn't, i would tell the user its running and email them when its done , and run it all in the background (to time outs). — , Apr 27 '15 at 01:13
@tbleckert Why don't you use temporary table to filter away the rows with no email address ? and use the temporary table to your real table ? You can show user the progress too... — Eko Junaidi Salam, Apr 27 '15 at 01:14
Well the main problem is that I can't do all this in one job. I need to chop the file somehow. Maybe I don't need to save it to a JSON file but something. Imagine if the file was to contain around 200,000 contacts instead. — tbleckert, Apr 27 '15 at 01:19
IMO, your queueing approach is ok. The timeout is mainly caused by the code, so you do need code improvement. — ihsan, Apr 27 '15 at 01:49
Yes I can see that using array_slice on a really large array, looping through 1000 rows and performing two db queries per iteration (total of 2000 operations) can be a performance issue. Hehe...But right now I don't see how I can improve it. — tbleckert, Apr 27 '15 at 01:55

score 2 · Accepted Answer · edited May 23 '17 at 11:58

I take 1000 contacts in each job and saves each contact in its own row in the database.

I've face the problem it too before, but in my problem I need to import employee presence about 50000 records, I've figure it out using parallelization. You might be noticed it too, so you take in 1000 contacts in each job queue. The real problem is that "Process Time out" we face right if we take it so much ?

So, my solution against that is to create more child process to do a job. If I create one job to do 1000 import, it'll uses more time and slower. So, I create 100 jobs queue with each job import 100 records. And I run it together. In this method your CPU loads will increase because of that. It's not a problem if you've high spec computer.

My propose is :

Create more jobs queue to do an import.
Avoid using too much looping.
If possible, store your data in memcached, because it'll speed up your process. I guess you think it too. Read about APC

You can read it here how to store your data in memory. Hope this help you a little :)

Yes that sounds like the way to go (along with some code optimizations). I'm will probably also move the database to its own server and run the queue jobs on a different server as well. So a total of 3 servers (web server is the third one). Also I currently use redis for cache, is memcached better? — tbleckert, Apr 27 '15 at 02:06
Yes, sure. You can do it too. you can run the queue jobs on a different server as well. It'll improve your import speed more faster :) — Eko Junaidi Salam, Apr 27 '15 at 02:10

score 0 · Answer 2 · answered Apr 27 '15 at 01:20

0

Is your php program waiting for this task to complete? That won't work. It will time out. You've noticed that.

You need to organize your operation in such a way that your php program starts the job on AWS SQS, and then tells your user the job is running and will be done after a while. Set low user expectations ("done in 15 minutes") and then exceed them (5 minutes) rather than the other way around.

Then you need a separate operation to query the job's status to see if it's done. You could do this by arranging for the job to update a row in a table when it is done.

answered Apr 27 '15 at 01:20

O. Jones

103,626
17
118
172

Well, the timeout happens in the background. The user imports these contacts to a specific contact list. When this happens I change status of this list to let's say "importing". So the user never experience the timeout. What happens is that when the job is done I release it back to the queue. So when it runs next time it continues. I did this to prevent timeouts. But sometimes it happens anyways. If I change it to only import 100 contacts in each job it works fine. – tbleckert Apr 27 '15 at 01:23
Also when each job is done I update the contact list with the amount of imported contacts. So the user sees something like "15,000 contacts left to import". – tbleckert Apr 27 '15 at 01:26

Performance tips on importing large sets of data

2 Answers2