0

The following result was obtained from Part 1. bigquerqy sql link a common grid_id between groups

with t1 as
(
Select 'obrien-t j' lname_forename_long,11 grid_id_ct ,'grid.416153.4' grid_id,2 name_seq ,1 group_seq UNION ALL
Select 'obrien-t j',1,'grid.1002.3',1,1 UNION ALL
Select 'obrien-terence',2,'grid.1008.9',1,2 UNION ALL
Select 'obrien-terence',4,'grid.416153.4',2,2 UNION ALL
Select 'obrien-terence',1,'grid.484852.7',3,2 UNION ALL
Select 'obrien-terence j',14,'grid.1002.3',1,3 UNION ALL
Select 'obrien-terence j',25,'grid.1008.9',2,3 UNION ALL
Select 'obrien-terence j',3,'grid.1019.9',3,3 UNION ALL
Select 'obrien-terence j',9,'grid.1623.6',4,3 UNION ALL
Select 'obrien-terence j',40,'grid.237081.f',5,3 UNION ALL
Select 'obrien-terence j',1,'grid.267362.4',6,3 UNION ALL
Select 'obrien-terence j',2,'grid.414094.c',7,3 UNION ALL
Select 'obrien-terence j',1,'grid.416060.5',8,3 UNION ALL
Select 'obrien-terence j',36,'grid.416153.4',9,3 UNION ALL
Select 'obrien-terence j',4,'grid.453219.8',10,3 UNION ALL
Select 'obrien-terence j',3,'grid.454055.5',11,3 UNION ALL
Select 'obrien-terence j',6,'grid.474069.8',12,3 UNION ALL
Select 'obrien-terence j',13,'grid.481253.9',13,3 UNION ALL
Select 'obrien-terence john',1,'grid.1002.3',1,4 UNION ALL
Select 'obrien-terence john',1,'grid.1008.9',2,4 UNION ALL
Select 'obrien-terence john',1,'grid.1623.6',3,4 UNION ALL
Select 'obrien-terence john',1,'grid.237081.f',4,4 UNION ALL
Select 'obrien-terence john',2,'grid.416153.4',5,4 UNION ALL
Select 'obrien-terrence',2,'grid.416153.4',1,5 UNION ALL
Select 'obrien-terrence j',1,'grid.416153.4',1,6 UNION ALL
Select 'obrien-terry',1,'grid.137628.9',1,7 UNION ALL
Select 'obrien-terry',2,'grid.237081.f',2,7 UNION ALL
Select 'obrien-terry',1,'grid.267362.4',3,7 UNION ALL
Select 'obrien-timothy',1,'grid.496867.2',1,8 UNION ALL
Select 'obrien-timothy',3,'grid.6142.1',2,8 
) 
select *, if(count(*) over win > 0, string_agg('' || group_seq) over win, '') links
from t1
window win as (partition by grid_id)    ;

The above does not include a counts column which I think maybe needed.

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6
obrien-t j 1 grid.1002.3 1 1 1,3,4 3
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6
obrien-terence 2 grid.1008.9 1 2 2,3,4 3
obrien-terence 1 grid.484852.7 3 2 2 1
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6
obrien-terence j 14 grid.1002.3 1 3 1,3,4 3
obrien-terence j 25 grid.1008.9 2 3 2,3,4 3
obrien-terence j 40 grid.237081.f 5 3 3,4,7 3
obrien-terence j 9 grid.1623.6 4 3 3,4 2
obrien-terence j 1 grid.267362.4 6 3 3,7 2
obrien-terence j 3 grid.1019.9 3 3 3 1
obrien-terence j 2 grid.414094.c 7 3 3 1
obrien-terence j 1 grid.416060.5 8 3 3 1
obrien-terence j 4 grid.453219.8 10 3 3 1
obrien-terence j 3 grid.454055.5 11 3 3 1
obrien-terence j 6 grid.474069.8 12 3 3 1
obrien-terence j 13 grid.481253.9 13 3 3 1
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6
obrien-terence john 1 grid.1002.3 1 4 1,3,4 3
obrien-terence john 1 grid.1008.9 2 4 2,3,4 3
obrien-terence john 1 grid.237081.f 4 4 3,4,7 3
obrien-terence john 1 grid.1623.6 3 4 3,4 2
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6
obrien-terry 2 grid.237081.f 2 7 3,4,7 3
obrien-terry 1 grid.267362.4 3 7 3,7 2
obrien-terry 1 grid.137628.9 1 7 7 1
obrien-timothy 3 grid.6142.1 2 8 8 1
obrien-timothy 1 grid.496867.2 1 8 8 1

The second part is take all the names with max(link_counts)

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6

Add the names that are not in max(link_counts) = 6 choose the nmaes with the highest grid_id_ct to give.

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts
obrien-timothy 3 grid.6142.1 2 8 8 1
obrien-terry 2 grid.237081.f 2 7 3,4,7 3
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6

If any of the new names can link to the link_counts = 6 update a column the links that can intersect.

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts is_intersect_links
obrien-timothy 3 grid.6142.1 2 8 8 1
obrien-terry 2 grid.237081.f 2 7 3,4,7 3 3,4
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6 3,4
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6 3,4
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6 3,4
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6 3,4
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6 3,4
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6 3,4

Since we can now link obrien-terry to the other obrien-t..... names update his grid_id to be the same as obrien-t..... grid.416153.4

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts is_intersect_links is_merged
obrien-timothy 3 grid.6142.1 2 8 8 1 '' FALSE
obrien-terry 2 grid.416153.4 2 7 3,4,7 3 3,4 TRUE
obrien-terrence j 1 grid.416153.4 1 6 1,2,3,4,5,6 6 3,4 FALSE
obrien-terrence 2 grid.416153.4 1 5 1,2,3,4,5,6 6 3,4 FALSE
obrien-terence john 2 grid.416153.4 5 4 1,2,3,4,5,6 6 3,4 FALSE
obrien-terence j 36 grid.416153.4 9 3 1,2,3,4,5,6 6 3,4 FALSE
obrien-terence 4 grid.416153.4 2 2 1,2,3,4,5,6 6 3,4 FALSE
obrien-t j 11 grid.416153.4 2 1 1,2,3,4,5,6 6 3,4 FALSE

I also added is_merged to indicate that a grid_id was updated. I have added multiple steps to make it clear but its possible its one or two steps. I have tried multiple ways of doing this using cartesain joins, intersect distinct to find a common grid between names but they all came up short. In simple terms I am trying to find figure out how many unique obriens I have based on being able to assign them to a common grid_id which is basically an address.

I'm not sure if I have over complicated it with all the intermediate steps. I don't need all the meta data columns I just need to end up with .

lname_forename_long grid_id is_merged
obrien-timothy grid.6142.1 FALSE
obrien-terry grid.416153.4 TRUE
obrien-terrence j grid.416153.4 FALSE
obrien-terrence grid.416153.4 FALSE
obrien-terence john grid.416153.4 FALSE
obrien-terence j grid.416153.4 FALSE
obrien-terence grid.416153.4 FALSE
obrien-t j grid.416153.4 FALSE

MY effort for samuel.

with t2 as (
with t1 as
(
Select "o'brien-t j" lname,11 grid_ct ,'grid.416153.4' grid_id,2 name_seq ,1 group_seq ,'1,2,3,4,5,6' links UNION ALL
Select "o'brien-terence",2,'grid.1008.9',1,2,'' UNION ALL
Select "o'brien-terence",4,'grid.416153.4',2,2,'' UNION ALL
Select "o'brien-terence",1,'grid.484852.7',3,2,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terence j",14,'grid.1002.3',1,3,'3,7' UNION ALL
Select "o'brien-terence j",25,'grid.1008.9',2,3,'' UNION ALL
Select "o'brien-terence j",3,'grid.1019.9',3,3,'' UNION ALL
Select "o'brien-terence j",9,'grid.1623.6',4,3,'' UNION ALL
Select "o'brien-terence j",40,'grid.237081.f',5,3,'' UNION ALL
Select "o'brien-terence j",1,'grid.267362.4',6,3,'' UNION ALL
Select "o'brien-terence j",2,'grid.414094.c',7,3,'' UNION ALL
Select "o'brien-terence j",1,'grid.416060.5',8,3,'' UNION ALL
Select "o'brien-terence j",36,'grid.416153.4',9,3,'' UNION ALL
Select "o'brien-terence j",4,'grid.453219.8',10,3,'' UNION ALL
Select "o'brien-terence j",3,'grid.454055.5',11,3,'' UNION ALL
Select "o'brien-terence j",6,'grid.474069.8',12,3,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terence j",13,'grid.481253.9',13,3,'3,4' UNION ALL
Select "o'brien-terence john",1,'grid.1002.3',1,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.1008.9',2,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.1623.6',3,4,'' UNION ALL
Select "o'brien-terence john",1,'grid.237081.f',4,4,'3,4' UNION ALL
Select "o'brien-terence john",2,'grid.416153.4',5,4,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terrence",2,'grid.416153.4',1,5,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terrence j",1,'grid.416153.4',1,6,'1,2,3,4,5,6' UNION ALL
Select "o'brien-terry",1,'grid.137628.9',1,7,'' UNION ALL
Select "o'brien-terry",2,'grid.237081.f',2,7,'3,7' UNION ALL
Select "o'brien-terry",1,'grid.267362.4',3,7,'' UNION ALL
Select "o'brien-timothy",1,'grid.496867.2',1,8,'' UNION ALL
Select "o'brien-timothy",3,'grid.6142.1',2,8,''
)
 select distinct a.lname, a.grid_id
 from t1 a, t1 b
 where a.lname <> b.lname
 and a.grid_id = b.grid_id
)
 select  distinct  lname,
 grid_id ,
 DENSE_RANK() OVER
                   (
                   --PARTITION BY a.lname_init1
                   ORDER BY grid_id
                   )  seq_num,
 from t2
)
select
'matched' is_matched,
lname
,grid_id
,seq_num
from t3
group by lname  ,grid_id,seq_num
having seq_num = (select max(seq_num )x from t3)
------------------------------------------
union all
--intersect distinct
------------------------------------------
select
'not_matched' is_matched,
lname
,grid_id
,seq_num
from t3
group by lname  ,grid_id,seq_num
having seq_num != (select max(seq_num )x from t3);

My result. I could not figure out how to merge o'brien-terry to the matched group. It also missed o'brien-timothy

is_matched lname grid_id seq_num
not_matched o'brien-terence j grid.1002.3 1
not_matched o'brien-terence john grid.1002.3 1
not_matched o'brien-terence grid.1008.9 2
not_matched o'brien-terence j grid.1008.9 2
not_matched o'brien-terence john grid.1008.9 2
not_matched o'brien-terence j grid.1623.6 3
not_matched o'brien-terence john grid.1623.6 3
not_matched o'brien-terence j grid.237081.f 4
not_matched o'brien-terence john grid.237081.f 4
not_matched o'brien-terry grid.237081.f 4
not_matched o'brien-terence j grid.267362.4 5
not_matched o'brien-terry grid.267362.4 5
matched o'brien-t j grid.416153.4 6
matched o'brien-terence grid.416153.4 6
matched o'brien-terence j grid.416153.4 6
matched o'brien-terence john grid.416153.4 6
matched o'brien-terrence grid.416153.4 6
matched o'brien-terrence j grid.416153.4 6

Samuel result.

lname_forename_long grid_id_ct grid_id name_seq group_seq links link_counts is_intersect_links
obrien-t j 1 grid.1002.3 1 1 1,3,4 3 1,3,4
obrien-terence 2 grid.1008.9 1 2 2,3,4 3 2,3,4
obrien-terence j 14 grid.1002.3 1 3 1,3,4 3 1,3,4
obrien-terence john 1 grid.1002.3 1 4 1,3,4 3 1,3,4
obrien-terry 2 grid.237081.f 2 7 3,4,7 3
obrien-timothy 1 grid.496867.2 1 8 8 1

1 Answers1

3

Consider below approach

with temp as (
  select *, array_length(split(links)) link_counts, 
    array_length(split(links)) < max(array_length(split(links))) over() merge_candidate
  from (
    select *, if(count(*) over win > 1, string_agg('' || group_seq) over win, '') links
    from t1
    window win as (partition by grid_id)
  )
  qualify 1 = row_number() over(partition by group_seq order by array_length(split(links)) desc, grid_id_ct desc)
)
select lname_forename_long, grid_id, merge_candidate as is_merged  
from temp where not merge_candidate
union all
select lname_forename_long, ifnull(merged_grid_id, grid_id), if(merged_grid_id is null, false, true) 
from (
  select any_value(t1).*, 
    any_value(( select t2.grid_id
      from unnest(split(t1.links)) link
      join unnest(split(t2.links)) link
      using(link)
      limit 1
    )) as merged_grid_id
  from (select * from temp where merge_candidate) t1
  cross join (select * from temp where not merge_candidate) t2
  group by to_json_string(t1)
)
order by grid_id desc, lname_forename_long desc    

if applied to sample data in your question - output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • I attempted to scale up by adding more lname_forename_long combinations chen-..., wang-.... variations and it still works. I had to modify the query as I added an extra column to account for more last names. The column is lname_init1 value='obrien-t' ```window win as (partition by lname_init1,grid_id) ``` It works for additional lname_forename_long but when I run the full table obrien-terry goes from is_merged true to is_merged false and keeps the original grid_id without merging. I'm still debugging trying to find what cuases it to fail when I inlcude additional lname_forename_long. – Denis The Menace Mar 27 '22 at 19:19
  • Mikahail, do you want me to accept your answer and create a new question with the extra column with a larger dataset. I'm not sure how big the sample would have to be to find where it fails. – Denis The Menace Mar 27 '22 at 19:23
  • If you have any new/follow-up question - please post it as new question and we will be happy to further help you. – Mikhail Berlyant Mar 27 '22 at 20:28
  • Hi Mikhail, I created a part iii. As mentioned in the post the code works with modification(2 columns added) for a small dataset but fails on the full dataset. I added a link to a csv file or alternatively I can provide a link to the table. Your answers are very much appreciated, I can figure out a lot of what you have done but some parts are over my head right now, I'm still looking at Part ii to see can I fix it myself but no luck so far. If you don't mind adding the add comment here and there that would be much appreciated in helping me understand what you have done. Thanks – Denis The Menace Mar 28 '22 at 15:27