0

I have a database with millions of client contacts. However, a lot of them are duplicated and may I ask some hero from here to advise how to identify those duplicates using Oracle SQL, PL/SQL or Excel.

Following is the data structure:

Client_Header

id integer (Primary Key)
Client_First_Name (varchar2)
Client_Last_Name (varchar2)
Client_Date_Of_Birth (timestamp)

Client_Address

Client_Id (Foreign Key ref Client_header)
Address_Line1 (varchar2)
Address_Line2 (varhchar2)
Adderss_Line3 (varchar2)
Suburb (Varchar2)
State (varchar2)
Country (varchar2)

My challenge is other than Client_Date_Of_Birth and those key fields, all fields are free text only.

For example, we have a client like following

Surname : Jones

First name : David

Client_Date_Of_Birth: 10/05/1975

Address: Unit 10 Floor 1, 20 Railway Parade, St Peter,  NSW 2044

However, as those fields are free text, I have a lot of data issues and following link (jpeg file only) illustrated some of those issues

Sample of data issues

Note:

  1. Other than those issues, sometime we may miss the first name or last name of the client (but not both) too
  2. Sometimes multiple problems can be find within the same record.
  3. Also sometime, the address may simply be the name of a school, shopping center etc.

  4. The system does not store any other id that can uniquely identify the client.

I understand it is close to impossible to gather all duplicate records where the client address is a school or shopping center. However, for other cases, is there anyway to identify most of the duplication.

Thank you for your help!

William Robertson
  • 15,273
  • 4
  • 38
  • 44
E. L.
  • 502
  • 3
  • 16
  • 1
    Don't add external links for showing sample data, only post plain text. Oracle or Postgres which one are you using? – Kaushik Nayak May 10 '18 at 06:20
  • Your question related to fuzzy search. Usually some third party implementations are used. – oreh May 10 '18 at 06:37
  • 3
    This feels like it ought to be an advert for a short-term consultancy gig rather than a StackOverflow question. – APC May 10 '18 at 06:57
  • @APC, probably more like a long-term one, if there are millions of records. – BobRodes May 10 '18 at 07:04
  • @BobRodes - depends whether the gig is to undertake the actual data analysis / cleansing or just establish a framework to do it :) – APC May 10 '18 at 07:06
  • May I suggest you read [this old post of mine](https://stackoverflow.com/a/8231225/146325)? It doesn't completely address (oh ho!) all your issues but as I said that would require a level of engagement beyond what is reasonable on StackOverflow. – APC May 10 '18 at 08:39

2 Answers2

1

Not a pretty sight, and I'm afraid I don't have good news for you.

This is a common problem in databases, especially if the data entry personnel are insufficiently trained. One of the main objectives in data entry training is to make the problem well understood and show ways to avoid it. Something to keep in mind in the future.

Unfortunately, there isn't any "magic wand" that will clean your data for you. I'm sorry, but you have before you one of the most tedious tasks in database maintenance. You're going to have to basically remove the duplicates by hand, and the job requires more of an editor than a database administrator.

If you have millions of records, of which perhaps a million are actually duplicates, I would estimate that it will take an expert working full time for at least two years -- and probably longer -- to clean up your problem: to do it in two years would require fixing 2000 records a day, with time off on weekends and two weeks of vacation.

In the end, the only sure way to remove all the duplicates is to compare all of them and remove them one at a time. But there are plenty of tricks you can use to get rid of blocks of them at once. Here are a few that I can think of with your data sample:

  1. Change "Dave" to "David" in both first and last name fields. (Make sure that nobody actually has the last name "Dave.")
  2. Change all instances of "Jones David" to "David Jones." (Make sure that there are no people named "Jones David".)
  3. Change "1/F" to "Floor 1."

The idea is to focus on some of the fields, and in those fields get all of the duplicates to be exact duplicates. Once you have that done, you delete all the records with the target values in the fields, except the one with the primary key of the record that you want to keep (if your table isn't keyed, you'll have to find another way to do it, such as selecting the top record into a new table).

This technique speeds things up for records with a large number of duplicates. Where you have only a few duplicates, it's quicker to just identify them one by one. One way to do this quickly is to go into edit mode on a table, work with a particular field (for example, the postal code field in this case), and put a unique value in that field when you want to mark it for deletion (in this case, perhaps a single zero). Then you can periodically delete all the records with that value in the field.

You'll also need to sort the data in multiple ways to find the duplicates, which it appears you already know.

As for your notes, don't try to identify all the ways that the data is messed up. Once you identify one record as a duplicate of another, you don't care what's wrong with it, you just have to get rid of it. If you have two records and each contains data that you want to keep that the other one is missing, then you'll have to consolidate them and delete one of them. And then go on to the next, and the next, and the next...

BobRodes
  • 5,990
  • 2
  • 24
  • 26
  • Hi Bob, thank you for your help. Yes I know it is not that easy. But you have raised a great idea - "Data Clean Up". I will try to work out a way that can bring up the quality of data slowly and slowly. Thank you very much! – E. L. May 10 '18 at 22:12
  • I respectfully disagree that data cleansing should be the first step. Is CLARK KENT the same as KENT CLARK? Who can tell? We need to look at all the other signifiers (address, date of birth, etc) and decide on the balance of probabilities. Once we have turned all the JONES DAVIDs into DAVID JONESs we have lost the ability to discover whether there really are people called JONES DAVID. – APC May 11 '18 at 05:44
  • @APC I agree. I sort of mentioned that with "make sure that there are no people named 'Jones David.'" – BobRodes May 14 '18 at 22:42
  • @APC. I agree with you too. I am also thing to use some machine learning technique to convert the address from unstructured format to a structure format. Do you have any idea on how to do this. I am currently learning Python for fixing up this issue. Because once I can convert the address into structured data, I will able to group records belongs to the same person much easier. Do you think it is a right direction? – E. L. May 16 '18 at 02:32
0

Some years ago I had a similar task and I tooks about one years to clean the data. What I did in short:

  1. send the address to api.addressdoctor.com for validation and split into single fields (with maps.googleapis.com it is also possible)
  2. use a first name and last name match list to check the names (we used namepedia.org). A lot depends on the quality of this list. This list should base on country of birth or of the first address. From the results we made a propability what kind of name it is (first/last/company).
  3. with this improved date you should create some normalized and fuzzy attributes. Normalized fields from names and address...like upper and just with alpha-numeric
  4. List item
  5. at the end I would change the data model a little bit to improve the data quality by design. I recommend you adding pre-title, post-title, middle-name and post-name fields. You should also add the splitted address fields like street, streetno, zip, location, longitude, latitude, etc... I would also change the relation between Client_Header and Client_Address with an extra address_Id as primary key...but this depends on the requirements. And at the end I would add some constraints to prevent duplicated entries.
  6. after all that is the deduplication not hard. Group just all normalized or fuzzy data together and greate a dense_rank. (I group by person, household, ...) Make a ranking over the attributes (I used data quality, data fillrate and transaction history for a score value) Finally it is your choice if you just want to delete the duplicates and copy the corresponding data to the living client or virtually connect the data via Client_Id in an extra Field.
  7. for insert and update processes you should create PL/SQL functions that check if fuzzy last-name (eg. first-name) + fuzzy address exist. Split the names and address fileds and check them with the address API's and match them with the names reference. If it is a single tuple data entry, show the best results to the user and let him decide.
  • HI Matthis, Thanks for your suggestion. I will go for a try with dummy test data for some sort of idea. However, unfortunately, due to privacy issue, I am not allowed to post my real data to those api. – E. L. Oct 26 '18 at 02:44