Given a table
name ip
A = |A 1 |
|B 1 |
|C 1 |
|B 2 |
|C 2 |
|D 3 |
|E 2 |
If any two names share same ip they belong in same group. Also ip with same name belong in same group. If you find all names for ip 1, {A, B, C}, then you should include all ips associated with {A,B,C} in that same group {1,2} and all then again all names with those ips that aren't already include {E} and so forth. In this particular example, anything in {A,B,C,E} x {1, 2} would be in the same group. The results for the above table would be
name ip group
A = |A 1 1 |
|B 1 1 |
|C 1 1 |
|B 2 1 |
|C 2 1 |
|D 3 2 |
|E 2 1 |
Just to be clear:
If names A, B, and C are all ip 1 then they are grouped together and you should have
A, 1 = group1
B, 1 = group1
C, 1 = group1
If names A, B also share ip 2, then they should NOT make a new group but instead should should be in the same group like this:
A, 1 = group1
B, 1 = group1
C, 1 = group1
A, 2 = group1
B, 2 = group1
The goal is to solve this in Google BigQuery SQL.
So far I have
select ip, row_number() over () as group,
GROUP_CONCAT(name,',') as names,
from A
group by ip
which yields all of the names for an ip and gives a group, but doesn't find all the ips for a name or find the group for all pairs that encompasses all names and ips.
Note, you can use split to access names that are concatenated (in this case with a ',').
UPDATE - This is called transitive closure. If this is too difficult, it would be sufficient to show how to do just the first iteration of a transitive closure (how to find all the ips associated with all the names associated with each ip) and label these as groups.