Firebase events dedup in Big Query - best practices?

Question

There seems to be 1-2% of duplicates in the Firebase analytics events exported to Big Query. What are the best practices to remove these?

Atm the client does not send a counter with the events (per session). This would provide an unambiguous way of removing duplicate events, so I recommend Firebase implementing that. However, at the moment, what would be a good way to remove the duplicates? Look at client user_pseudo_id, event_timestamp, and event_name - fields and remove all except one with same triple?

How does event_bundle_sequence_id -field work? Will duplicates have the same value in this field, or different? That is, are duplicate events sent within the same bundle, or in different bundles?

Is Firebase planning to remove these duplicates earlier in the processing, either for Firebase analytics itself, or in the export to Big Query?

Standard SQL to check for duplicates in one days events:

with n_dups as
(
SELECT event_name, event_timestamp, user_pseudo_id, count(1)-1 as n_duplicates
FROM `project.dataset.events_20190610`
group by event_name, event_timestamp, user_pseudo_id
)
select n_duplicates, count(1) as n_cases
from n_dups
group by n_duplicates
order by n_cases desc

It would be great if someone from @google could provide an answer on this as we are facing the same issue. Firebase data should have an event_id with a logical unique key, that we could use to deduplicate. As an example, this is how it can be done in snowplow: https://snowplowanalytics.com/blog/2015/08/19/dealing-with-duplicate-event-ids/ — Rogier Werschkull, Jun 30 '20 at 05:42
@Rogier Werschkull you may use hash of user_pseudo_id and event_bundle_sequence_id as an event_id. — Ayrat, Aug 18 '21 at 12:17
Also facing the same issue. But why should we deduplicate (if not as a temporary measure)? It's a problem on @google 's end and I don't want to take more query cost and time because I have to cleanup their garbage first — Giorgio Terreni, Sep 10 '21 at 06:40
at a minimum, (user_pseudo_id, event_bundle_sequence_id, event_name) — astrojuanlu, Aug 15 '22 at 13:39

Maksym Skorupskyi · Answer 1 · 2022-11-01T11:55:20.397

We use the QUALIFY clause for deduplication Firebase events in BigQuery:

SELECT
  *
FROM
  `project.dataset.events_*`
QUALIFY
  ROW_NUMBER() OVER (
    PARTITION BY
      user_pseudo_id,
      event_name,
      event_timestamp,
      TO_JSON_STRING(event_params)
    ) = 1

Qualifying columns:

  - name: user_pseudo_id
    description:  Autogenerated pseudonymous ID for the user -
                  Unique identifier for a specific installation of application on a client device,
                  e.g. "938642951.1666427135".
                  All events generated by that device will be tagged with this pseudonymous ID,
                  so that you can relate events from the same user together.

  - name: event_name
    description:  Event name, e.g. "app_launch", "session_start", "login", "logout" etc.

  - name: event_timestamp
    description:  The time (in microseconds, UTC) at which the event was logged on the client,
                  e.g. "1666529002225262".

  - name: event_params
    description:  A repeated record (ARRAY) of the parameters associated with this event.

Firebase events dedup in Big Query - best practices?

1 Answers1