Different results when distinct count by different time periods

Question

I am trying to get a count of unique visitors. I first checked it by total without separating it by anytime frame.

Main table (big data table sample):

+-----------+----+-------+
|theDateTime|vD  | vis   |
+----------------+-------+
|2018-10-03 |123 |abc    |
|2018-10-04 |123 |abc    |
|2018-10-04 |123 |pqr    |
|2018-10-05 |123 |xyz    |
+-----------+----+-------+

the total distinct count of the above will be 3 but when I group by day abc is counted twice. First on the 3rd and then on the 2nd. I just want the first one counted.

My Query for total:

select
  d.eId AS vD
  , COUNT(DISTINCT visitorId) AS vis
 from decisions  
 WHERE d.eId = 123 
 AND timestamp BETWEEN unix_timestamp('2018-10-03 00:00:00')*1000 AND 
 unix_timestamp('2018-10-06 12:17:00')*1000
 GROUP BY d.eId
 ORDER BY vId

My Results:

+----+---------+
| vD | vis     |
+----+---------+
|123 | 3       |
+----+---------+

My Query by Day:

select DISTINCT
cast(from_unixtime(timestamp DIV 1000) AS date) AS theDateTime
, d.eId AS vD
, COUNT(DISTINCT visitorId) AS vis
from decisions  
WHERE timestamp BETWEEN unix_timestamp('2018-10-03 00:00:00')*1000 AND 
unix_timestamp('2018-10-06 12:17:00')*1000
AND d.eId IN (11550123588)
GROUP BY cast(from_unixtime(timestamp DIV 1000) AS date), 
d.vD
ORDER BY vD, theDateTime

My Results:

+-----------+----+-------+
|theDateTime|vD  | vis   |
+----------------+-------+
|2018-10-03 |123 |   1   |
|2018-10-04 |123 |   2   |
|2018-10-05 |123 |   1   |
+-----------+----+-------+

The total of this is 1122585. WHich is more than the total sum

I know this is because just incase the visitor is repeated on a different day and when I group by day he is counted twice. Is there a way for me to not count the visitor on day 2 if he has already been counted on day 1?

Please help!

What does your data look like? – sramalingam24 Oct 13 '18 at 20:43 — sramalingam24, Oct 13 '18 at 20:43
@sramalingam24 I have added that. – noobeerp Oct 13 '18 at 22:46 — noobeerp, Oct 13 '18 at 22:46

score 0 · Answer 1 · answered Oct 14 '18 at 01:45

If I understood this correctly, you just need a different view of the data.

val df = Seq(("2018-10-03",123,"abc"),
("2018-10-04",123,"abc"),
("2018-10-05",123,"pqr"),
("2018-10-06",123,"xyz")).toDF("theDateTime","vD","vis").withColumn("theDateTime", $"theDateTime".cast("timestamp"));

df.show

import org.apache.spark.sql.functions._
val df1 = df.groupBy("vis").pivot("vD").agg(min("theDateTime")).sort($"123")
df1.show

+---+-------------------+
|vis|                123|
+---+-------------------+
|abc|2018-10-03 00:00:00|
|pqr|2018-10-05 00:00:00|
|xyz|2018-10-06 00:00:00|
+---+-------------------+

Now if you group by "123" you will be able to get the unique count per day. Does this help?

score 0 · Answer 2 · answered Oct 14 '18 at 03:56

If I understand correctly, you can do this in SQL with a subquery:

select min_dt, count(distinct visitorId) AS vis
from (select eid, vis, min(thedatetime) as min_dt
      from decisions d
      where d.eid = 123 and . . .
      group by vis, eid
     ) d
group by min_dt

Different results when distinct count by different time periods

2 Answers2