The GA4 update: matching big query data with Google analytics
UPDATE: to get the right number of sessions you need to contact user psudeo id and ga session id
After my series of matching big query data with Universal Analytics being helpful. I thought with the forced update to GA4, and the ability for all to get raw data I would start the series again with GA4 data.
Google provides access to their sample GA4 data from their merchandise store: https://www.googlemerchandisestore.com/. The data set is bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*
This post will look at:
Users
New Users
Sessions
Number of Session per User
Page Views
Pages/Sessions
Avg. Session Duration
Bounce Rate
Users
Though users should be fairly easy to calculate because GA4 is all user-based, there could be a discrepancy between BQ and GA4 if you use Google Signals. See this article: https://support.google.com/analytics/answer/9445345?hl=en#zippy=%2Cin-this-article
As Google states:
Google Analytics 4 exports event data associated with anonymous cookies to BigQuery. Theoretically, this can result in the same user being counted multiple times across different devices. In contrast, Google signals measures users across devices. Therefore, data exported to BigQuery might show more users when compared with reports based on Google signals data.
But let’s keep things simple and assume that you didn’t activate Google Signals. Here we are getting the users for the month of January 2021
SELECT COUNT(DISTINCT user_pseudo_id) AS user_count, FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131"
New Users
When people say that GA4 is event based, you better believe it. Everything is an event. Event if it is new users. So if you want to could how many new users you count the user ids that trigger that event:
SELECT count(distinct case when event_name = "first_visit" then user_pseudo_id end) as new_users FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131"
Sessions
Also your sessions are events, but just make sure you count the timestamp and the user id or you would just get the number of users.
SELECT count(distinct case when event_name = "session_start" then concat(user_pseudo_id,event_timestamp) end) as sessions FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131"
Pageviews
Though I find this metric useless without context (like which pages) some people like it as it usually a big number you can shout about. Again your page views are events so you can do the same as sessions.
SELECT count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end) as page_views FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131"
Bounces and Bounce Rate
The bounce rate has been removed from GA4 in favour of the engagement rate (See below). Some people will still want bounce rate though. Here we are assuming that a bounce is if they only viewed 1 page, as there isn’t interactive/non-interactive events so we don’t want to count events.
SELECT count(distinct case when page_views = 1 then ga_session_id end) /count(distinct concat(user_pseudo_id, ga_session_id)) as session_bounce_rate from( SELECT user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id, count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end) as page_views FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131" group by 1,2)
Engagement rate
So what we really should be using is engagement rate. Which is easier to get than bounce rate.
SELECT sum(cast(session_engaged as int))/ count(distinct concat(user_pseudo_id, ga_session_id)) as session_engagement_rate from( SELECT user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id, max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged, FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131" group by 1,2)
Average Session Duration
Again this has changed slightly to engaged session duration. This will be lower than your Universal Analytics session duration as it only counts when the tab is in focus. Below I show how to do both.
SELECT sum(engagement_time_msec)/1000 #in milliseconds /count(distinct concat(user_pseudo_id,ga_session_id)) as ga4_session_duration, sum(end_time-start_time)/1000000 #timestamp in microseconds /count(distinct concat(user_pseudo_id,ga_session_id)) as ua_session_duration, from( SELECT user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id, max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec, min(event_timestamp) as start_time, max(event_timestamp) as end_time FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131" group by 1,2)
Number of Sessions per User
This one is pretty easy after a couple of hard ones before. Just dived session by user.
SELECT count(distinct case when event_name = "session_start" then concat(user_pseudo_id,event_timestamp) end)/ count(distinct user_pseudo_id) as sessions_per_user FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131"
Pages per Session
Again fairly similar to the previous - divid page views by sessions.
SELECT count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end)/ count(distinct case when event_name = "session_start" then concat(user_pseudo_id,event_timestamp) end) as page_per_session FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131"
Putting it all together
So now if you want them all in one query, this time I’m using with clauses for the nested queries, here you go:
with session_details as ( SELECT count(distinct case when page_views = 1 then ga_session_id end) /count(distinct concat(user_pseudo_id, ga_session_id)) as session_bounce_rate, sum(cast(session_engaged as int))/ count(distinct concat(user_pseudo_id, ga_session_id)) as session_engagement_rate, sum(engagement_time_msec)/1000 #in milliseconds /count(distinct concat(user_pseudo_id, ga_session_id)) as ga4_session_duration, sum(end_time-start_time)/1000000 #timestamp in microseconds /count(distinct concat(user_pseudo_id, ga_session_id)) as ua_session_duration from( SELECT user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id') as ga_session_id, count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end) as page_views, max((select value.string_value from unnest(event_params) where key = 'session_engaged')) as session_engaged, max((select value.int_value from unnest(event_params) where key = 'engagement_time_msec')) as engagement_time_msec, min(event_timestamp) as start_time, max(event_timestamp) as end_time FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` where _table_suffix BETWEEN "20210101" and "20210131" group by 1,2) ) SELECT COUNT(DISTINCT user_pseudo_id) AS user_count, count(distinct case when event_name = "first_visit" then user_pseudo_id end) as new_users, count(distinct case when event_name = "session_start" then concat(user_pseudo_id,event_timestamp) end) as sessions, count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end) as page_views, max(session_bounce_rate) as session_bounce_rate, max(session_engagement_rate) as session_engagement_rate, max(ga4_session_duration) as ga4_session_duration, max(ua_session_duration) as ua_session_duration, count(distinct case when event_name = "session_start" then concat(user_pseudo_id,event_timestamp) end)/ count(distinct user_pseudo_id) as sessions_per_user, count(distinct case when event_name = "page_view" then concat(user_pseudo_id,event_timestamp) end)/ count(distinct case when event_name = "session_start" then concat(user_pseudo_id,event_timestamp) end) as page_per_session FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*` cross join (select * from session_details) where _table_suffix BETWEEN "20210101" and "20210131"
And there we are. I will keep on updating the queries from the UA series. If there is anything in particular you would like let me know.