Importing CSV to database (duplicate entries)

Question

My job requires that I look up information on a long spreadsheet that's updated and sent to me once or twice a week. Sometimes the newest spreadsheet leaves off information that was in the last spreadsheet causing me to have to look through several different spreadsheets to find the info I need. I recently discovered that I could convert the spreadsheet to a CSV file and then upload it to a database table. With a few lines of script all I have to do is type in what I'm looking for and Voila! Now I just got the newest spreadsheet and I'm wondering if I can just Import it on top of the old one. There is a unique number for each row that I have set to primary in the database. If I try to import it on top of the current info will it just skip the rows where the primary would be duplicated or would it just mess up my database?

Thought I'd ask the experts before I tried it. Thanks for your input!

Details: the spreadsheet consists of clients of ours. Each row contains the client's name, a unique id number, their address and contact info. I can set the row containing the unique ID to primary, then upload it. My concern is that there is nothing to signify a new row in a csv file (i think). when I upload it it it gives me the option to skip duplicates but will it skip the entire row or just that cell causing my data to be placed in the wrong rows.. It's apache server IDK what versions of mysql. I'm using 000webhost for this.

Tough to say without knowing more specifics. most likely you could either totally wreck your data or do exactly what you want to depending on how you handle it. Can you post more details about how the data is laid out in the spreadsheet and what DB server you're using? — JGrubb, Aug 02 '12 at 02:00

score 2 · Answer 1 · answered Aug 02 '12 at 02:08

Higgs,

This issue in database/ETL terminology is called deduplication strategy. There is not a template answer for this, but I suggest these helpful readings:

Academic paper - Joint Deduplication of Multiple Record Types in Relational Data

Deduplication article

Some open source tools:

Duke tool

Data cleaner

score 0 · Accepted Answer · answered Aug 11 '12 at 06:30

0

there's a little checkbox when you click on import near the bottom that says 'ignore duplicates' or something like that. simpler than i thought.

answered Aug 11 '12 at 06:30

cream

1,129
5
16
26

Importing CSV to database (duplicate entries)

2 Answers2