1

Given a table

     name  ip 
A = |A     1  |
    |B     1  |
    |C     1  |
    |B     2  |
    |C     2  |
    |D     3  |
    |E     2  |

If any two names share same ip they belong in same group. Also ip with same name belong in same group. If you find all names for ip 1, {A, B, C}, then you should include all ips associated with {A,B,C} in that same group {1,2} and all then again all names with those ips that aren't already include {E} and so forth. In this particular example, anything in {A,B,C,E} x {1, 2} would be in the same group. The results for the above table would be

     name  ip  group
A = |A     1     1    |
    |B     1     1    |
    |C     1     1    |
    |B     2     1    |
    |C     2     1    |
    |D     3     2    |
    |E     2     1    |

Just to be clear:

If names A, B, and C are all ip 1 then they are grouped together and you should have

A, 1 = group1
B, 1 = group1
C, 1 = group1

If names A, B also share ip 2, then they should NOT make a new group but instead should should be in the same group like this:

A, 1 = group1
B, 1 = group1
C, 1 = group1
A, 2 = group1
B, 2 = group1

The goal is to solve this in Google BigQuery SQL.

So far I have

select ip, row_number() over () as group,
GROUP_CONCAT(name,',') as names,
from A
group by ip

which yields all of the names for an ip and gives a group, but doesn't find all the ips for a name or find the group for all pairs that encompasses all names and ips.

Note, you can use split to access names that are concatenated (in this case with a ',').

UPDATE - This is called transitive closure. If this is too difficult, it would be sufficient to show how to do just the first iteration of a transitive closure (how to find all the ips associated with all the names associated with each ip) and label these as groups.

cgnorthcutt
  • 3,890
  • 34
  • 41
  • What you are looking for is not cluster analysis. Instead, you want what isknown as **transitive closure**. It's not possible to do this in a SQL query. Instead, you need something capable of doing iterations or recursion. – Has QUIT--Anony-Mousse Apr 23 '15 at 07:27
  • Aye - I agree you can't solve a problem which requires convergence in SQL, but I am looking for an approximation - in other words, how to compute the first transitive group (so connect all the names, then all the ips of those names). Then I can repeat that iteration N times (where N is fairly small) in sql and call it "close enough." – cgnorthcutt Apr 23 '15 at 18:15
  • Strictly speaking transitive closure is not possible with relational algebra, but it is possible with SQL - using WITH RECURSIVE clause. But this is mostly theoretical, since BigQuery doesn't support such clause. – Mosha Pasumansky Apr 23 '15 at 19:27
  • Thanks I've updated the question to include an approximation as a valid answer. – cgnorthcutt Apr 23 '15 at 19:41

1 Answers1

2

Here is my solution for the first iteration. It is a bit long and might be improved, but this is what I have.

Step 1.

select name, nest(ip) ips, group_concat(string(ip)) sip from 
(select 'a' name, 1 ip),
(select 'b' name, 1 ip),
(select 'c' name, 1 ip),
(select 'b' name, 2 ip),
(select 'c' name, 2 ip),
(select 'd' name, 3 ip),
(select 'e' name, 2 ip)
group by name

Store the results in temporary table x

Step 2.

select a.name name, group_concat(b.name) as cluster from (
select a.name, b.name from (
select a.*, b.* from dataset.x a cross join dataset.x b
) omit record if every(not b.sip contains string(a.ips))
group by 1, 2 order by 1, 2) group by 1

Store the results in temporary table y

Step 3.

select cluster from (
select group_concat(part) cluster from (
select name, part from (
select a.name name, split(b.cluster) part 
from dataset.y a cross join dataset.y b
where b.cluster contains a.name) group by 1, 2 order by 1, 2) 
group by name) group by cluster

This should produce all unique clusters, i.e.

a,b,c,e
d
Mosha Pasumansky
  • 13,206
  • 5
  • 32
  • 55
  • Thanks for the attempt. However, one does not know the names are 'a','b','c','d','e' in advance (and there may be thousands/millions of names) and thus you will not be able to use those keywords in your query. – cgnorthcutt Apr 23 '15 at 21:17
  • I haven't used the actual values for the column 'name' in the solution. The first query is illustrative to use the same data set as in your example. In real one, you will use the input table instead. – Mosha Pasumansky Apr 23 '15 at 21:19
  • Ah - I will try this out and get back to you on voting and revising my comments upon success/failure then :) – cgnorthcutt Apr 24 '15 at 01:35
  • Really awesome so far, however the main issue I see with your approach is in the third step. "where b.cluster contains a.name" this will fail if there are two names where one name is a subset of the other. For example b.cluster might contain johnny,susan and a.name = john, and this will be true, when it shouldn't be since john is a different user than johnny but the string 'johnny,susan' certainly contains 'john' – cgnorthcutt Apr 30 '15 at 03:55
  • The way to workaround it is to surround the result of group_concat with ",", i.e. concat(",", group_concat(...), ",") and then look for concat(",", a.name, ",") instead of a.name (this assumes that comma cannot happen inside field values, if it can another character should be chosen). – Mosha Pasumansky Apr 30 '15 at 04:08
  • Nice. This is what I did.....where right(b.cluster, length(a.name) + 1) = ','+a.name or left(b.cluster, length(a.name) + 1) = a.name+',' or b.cluster contains ','+a.name+',' – cgnorthcutt Apr 30 '15 at 04:37
  • I wasn't able to find a more succinct way though? SPLIT only seems to work in select clause, not in where clause. – cgnorthcutt Apr 30 '15 at 04:38