2

I am looking for rows that are the same, and I am having a problem when the birthdate field is null.

I currently have two scripts which are able to identify the three kinds of sameness I am looking for, but I would like to do it with one.

Yes, the data could be updated to use a known date which could mean null, but I am not wanting to change the data or copy over to a new table.

I did manage to create a single script that does all the work, but it takes a horrific amount of time to run. I was only able to do so thanks to this answer. The two scripts both run in under a second, while the single script takes 26 minutes.

Script 1 which matches null birthdates but misses first names the same

SELECT last_name, birthdate, count(distinct first_name) 
FROM merged_person 
    having count(distinct first_name) >1 
GROUP BY last_name, birthdate 
ORDER BY last_name;

Script 2 which misses null birthdates

SELECT *
FROM merged_person
WHERE (last_name, birthdate) IN
    (SELECT last_name, birthdate
    FROM merged_person
    GROUP BY last_name, birthdate
    HAVING COUNT(*) > 1
    )
ORDER BY last_name, birthdate;

Script 3 variants which finds all results but takes far too long.
Variant 1

SELECT *
FROM merged_person
WHERE (last_name, nvl(birthdate, '0001-01-01')) IN
    (SELECT last_name, nvl(birthdate, '0001-01-01')
    FROM merged_person
    GROUP BY last_name, birthdate
    HAVING COUNT(*) > 1
    )
ORDER BY last_name, birthdate;

variant 2

SELECT *
FROM merged_person
WHERE (last_name, nvl(to_char(birthdate, 'DD-MM-YYYY'), '00-00-0000')) IN
    (SELECT last_name, nvl(to_char(birthdate, 'DD-MM-YYYY'), '00-00-0000')
    FROM merged_person
    GROUP BY last_name, birthdate
    HAVING COUNT(*) > 1
    )
ORDER BY last_name, birthdate;

Using nvl(birthdate, '0001-01-01') was unable to catch all cases.

Is there a way to improve performance or another way to have null birthdates match?

Edit:

The sameness are records that have identical last name and birthdate, the rest of the fields are unimportant for the matching but they are first name, file number, employee id, sex, a bunch of hire information.

As I am working with production data, I am unable to show any rows returned.

Examples

  • Bruce Banner, Born 1966, Empid 1234, Male
  • Sarah Banner, Born 1966, Empid 1345, Female
  • Anne Rice, Null, Empid 1134, Female
  • Ben Rice, Null, Empid 1153, Male

Identifying the Banner's is easy. Its the Rice's that are causing problems.

Fering
  • 322
  • 2
  • 18
  • Can you add some sample data and expected results? A few rows of data will go a long way to help us to understand what you need. – The Impaler Jun 03 '19 at 20:57
  • Look at function based indexes. They can be extremely helpful in your situation. – Sam M Jun 03 '19 at 21:26
  • 1
    What is the precise definition of "sameness" you are looking for? Don't ask us to guess from code that doesn't even do what you need. –  Jun 03 '19 at 22:17
  • 3
    I'm a bit puzzled by `nvl(birthdate, '0001-01-01')`. Why default `birthdate` to a string and not a date? – William Robertson Jun 03 '19 at 22:42
  • @WilliamRobertson It was what I saw in the linked answer and I played around with it a bit. I am not unfamiliar with Oracle but its also not the focus of my job, so I have some gaps in my knowledge. This just happened to be the first type conversion that appeared in my search results. – Fering Jun 04 '19 at 12:18

2 Answers2

2

Analytic functions can avoid the issues with self-joins and type conversions:

--People with the same last name and birth date. select first_name, last_name, birthdate from (

--All rows, with a count of people with same last name and birth date.
select first_name, last_name, birthdate,
    count(*) over (partition by last_name, birthdate) duplicate_count
from merged_person

) where duplicate_count >= 2 order by first_name, last_name, birthdate;


The original problem started with the classic NULL issue: NULL is not equal to NULL, but NULL is also not not equal to NULL. That's confusing and will take some time to understand, but it does make sense if you think of NULL as "the absence of a value".

Using NVL to avoid NULL issues is a good idea, but the implicit type conversion may have caused problems. It would be cleaner to use an ISO 8601 date literal. You're already using the right format, just put the keyword DATE in front of it.

Change:

nvl(birthdate, '0001-01-01')

To:

nvl(birthdate, date '0001-01-01')
Jon Heller
  • 34,999
  • 6
  • 74
  • 132
  • Thank you for the help. Your first suggestion is what did the trick, it found all the results as was fast, coming in under a second. The alternative NVL however was just as slow as the others at 1508 seconds. – Fering Jun 04 '19 at 13:17
0
  1. If the intention is to retrieve rows with same last name and birth date then, exclude the rows with NULL in date of birth column from the query.
  2. If this table is huge in size then, I would not suggest using IS NULL or IS NOT NULL operators. However, if you have some flag with Y/N (DOB present or not) then, you can use this flag to filter out records with no date of birth.
  3. But IS NULL or IS NOT NULL is the only way to select rows in your case. Using row level functions in SELECT clause will have some overhead but not at large as when we use them in LHS of the WHERE clause and for that matter using IS NULL and IS NOT NULL.
  4. Also, I do not see WHERE clause in your SQL statement. How big is this table? Are the SELECT columns part of index as leading columns in some index?
SatishSK
  • 56
  • 3