6

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():

incorrect count value for userid when connector is postgres

My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:

correct count value for userid when connector is a csv file

My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:

SELECT
  userid,
  submissionid,
  date
FROM mytable

Workaround

If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:

SELECT
  userid,
  submissionid,
  to_char(date,'YYYY-IW') as date_byweek
FROM mytable

While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product


How to Reproduce

If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:

> SELECT * FROM mytable
  userid  submissionid
-------- -------------
       1             1
       2             2
       1             3
       1             4
       3             5

> COUNT(DISTINCT userid) -- ERROR:    Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)
Paulo
  • 416
  • 4
  • 11

3 Answers3

2

I'm happy to report that as of Sep 17 2020, there's a workaround.

DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.

Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).

This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.

yassa
  • 68
  • 4
1

I have the same issue with MySQL Connector. But my problem is solved, when I change date field format in DB from DATETIME (YYYY-MM-DD HH:MM:SS) to INT (Unixtimestamp). After connection this table to the Googe Datastudio I set type for this field as Date (YYYYMMDD) and all works, as expected. Hope, this may help you :)

  • Ditto - if the date returned by the DB isnt in a format Data Studio can parse (e.g YYYYMMDD), then it treats it like a string I suppose and does funky things with the distincts. – Sean Sep 02 '20 at 19:49
0

In this Google forum there is a curious solution by Damien Choizit that involves combining your data source with itself. It works well for me.

https://support.google.com/datastudio/thread/13600719?hl=en&msgid=39060607

It says:

I figured out a solution in my case: I used a Blend Data joining twice the same data source with corresponding join key(s), then I specified a data range dimension only on the left side and selected the columns I wanted to CTD aggregate as "dimensions" (and not metric!) on the right side.