I have a function that lets users import contacts (email address and name). Some users imports files with around 70,000 contacts. Can be either xls or csv. This is what I have now.
- The user selects a file (from their computer) that they want to import contacts from.
- I save the file on the server and creat a database entry with a reference to the file location.
- Amazon SQS is used to handle this in the background.
- The first time the job runs I process the file, only saving rows with email address and name (if found). The data is saved to a json file in the same location and cached. Then I release the job back in the queue.
- Now the contacts are ready to be imported. I take 1000 contacts in each job and saves each contact in its own row in the database. I use array_slice to skip contacts in the json file. The skip count is saved to the database.
- When no contacts left, the job is deleted and everything is done.
This is pretty much the whole process. I also have a check (database lookup) to check for duplicates. Only unique email addresses is allowed.
The problem I have is that the job seems to take too much time and I get timeouts. This leads to the import taking to much time.
So my question is: is there anything I can do better?
Let me know if you need anything else. I don't have much experience with big data and many users.
EDIT: I don't need code. What I'm after is like, is the problem a server issue? Maybe moving the database to its own server will do the trick? Or should I just use a different approach?
EDIT 2: The user can see the progress of the import. So I need to calculate the number of contacts and to do that I need to first filter away rows with no email address. And I also trims it and the name column. When I did this I found easier to save the new set of data to a JSON file.
EDIT 3: The timeouts happens when saving users to the database, not in the initial processing and creation of the json file.
EDIT 4: One way to speed it the job might be to save it into chunks from the beginning (in the first processing). This way I don't need to handle a skip counter and I don't have to use array_slice on the large data set. Also now when I think about it, it's kind of stupid to save it to a json file and then cache it. Why don't cache the array from the beginning?