I am looking for rows that are the same, and I am having a problem when the birthdate field is null
.
I currently have two scripts which are able to identify the three kinds of sameness I am looking for, but I would like to do it with one.
Yes, the data could be updated to use a known date which could mean null, but I am not wanting to change the data or copy over to a new table.
I did manage to create a single script that does all the work, but it takes a horrific amount of time to run. I was only able to do so thanks to this answer. The two scripts both run in under a second, while the single script takes 26 minutes.
Script 1 which matches null birthdates but misses first names the same
SELECT last_name, birthdate, count(distinct first_name)
FROM merged_person
having count(distinct first_name) >1
GROUP BY last_name, birthdate
ORDER BY last_name;
Script 2 which misses null birthdates
SELECT *
FROM merged_person
WHERE (last_name, birthdate) IN
(SELECT last_name, birthdate
FROM merged_person
GROUP BY last_name, birthdate
HAVING COUNT(*) > 1
)
ORDER BY last_name, birthdate;
Script 3 variants which finds all results but takes far too long.
Variant 1
SELECT *
FROM merged_person
WHERE (last_name, nvl(birthdate, '0001-01-01')) IN
(SELECT last_name, nvl(birthdate, '0001-01-01')
FROM merged_person
GROUP BY last_name, birthdate
HAVING COUNT(*) > 1
)
ORDER BY last_name, birthdate;
variant 2
SELECT *
FROM merged_person
WHERE (last_name, nvl(to_char(birthdate, 'DD-MM-YYYY'), '00-00-0000')) IN
(SELECT last_name, nvl(to_char(birthdate, 'DD-MM-YYYY'), '00-00-0000')
FROM merged_person
GROUP BY last_name, birthdate
HAVING COUNT(*) > 1
)
ORDER BY last_name, birthdate;
Using nvl(birthdate, '0001-01-01')
was unable to catch all cases.
Is there a way to improve performance or another way to have null birthdates match?
Edit:
The sameness are records that have identical last name and birthdate, the rest of the fields are unimportant for the matching but they are first name, file number, employee id, sex, a bunch of hire information.
As I am working with production data, I am unable to show any rows returned.
Examples
- Bruce Banner, Born 1966, Empid 1234, Male
- Sarah Banner, Born 1966, Empid 1345, Female
- Anne Rice, Null, Empid 1134, Female
- Ben Rice, Null, Empid 1153, Male
Identifying the Banner's is easy. Its the Rice's that are causing problems.