Replicating The Google Analytics All Pages Report In BigQuery
Editor's Note: This article is about Universal Analytics properties, which will stop processing data on July 1, 2023 (July 2024 for Analytics 360 properties). For Tracking Single Page Applications with Google Analytics 4 Properties, please refer to this article instead.
While BigQuery is often the perfect tool for doing data science and machine learning with your Google Analytics data, it can sometimes be frustrating to query basic web analytics metrics. In this post, I’ll walk through calculating some fundamental metrics at the page level by replicating the All Pages report for the Google Merchandise Store in BigQuery.
I will be using Google’s sample dataset for the Google Merchandise store for August 1, 2016. You can compare this data to data in Google Analytics by viewing the Google Analytics demo account. I will be using the standard SQL dialect.
Word of warning: Although this post discusses the All Pages report – probably the most popular report in Google Analytics – it is not for the faint of heart. We will discuss a lot of technical details surrounding Google Analytics definitions, the SQL language, and the format of the BigQuery data. If you have not worked in BigQuery before, you may want to start off with a gentler introduction.
Primary Dimension – Page
In BigQuery the page dimension, or URL, is stored in the field hits.page.pagePath. Recall that Google Analytics data is stored at the session level. To access hit level information, we will need to unnest our table by hits. The query below shows all URLs that were visited by users.
SELECT
hits.page.pagePath
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits
GROUP BY
hits.page.pagePath
Pageviews
Next, we want to add in the number of pageviews associated with each URL. If you look at the Google Analytics schema, the only pageview related field you will see is totals.pageviews. Use extreme caution when using this field! It is a session level field (it’s counting the total number of pageviews in the entire session) and it should only be mixed with hit level fields with extreme care. This metric is not useful for replicating the All Pages report. Instead, we will need to calculate this metric ourselves.
So, what is a pageview and how do we calculate it? We often think of pageviews as the number of times that a page was loaded in the browser. In Google Analytics, this is measured by the number of times we send a “pageview” hit to Google Analytics. (Recall there are other types of hits that can be sent to Google Analytics, such as event, social, timing, and transaction hits.)
To calculate pageviews in Google Analytics, we need to count the number of times a hit of type PAGE is associated with each URL. As above, we will use the field hits.page.pagePath to identify the URL. Then filter the results by setting the hits.type field equal to ‘PAGE’ and count up the hits.
SELECT
hits.page.pagePath,
COUNT(*) AS pageviews
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits
WHERE
hits.type = 'PAGE'
GROUP BY
hits.page.pagePath
ORDER BY
pageviews DESC
Unique Pageviews
In Google Analytics, unique pageviews represent the number of sessions during which a page was viewed. Once again, we want to focus on the hits.type equals ‘PAGE’, but rather than count the hits, now we want a count distinct on the number of sessions.
To count unique sessions, we need a unique way of identifying a session. The visitId field looks promising for this. However, this identifier has some issues when a session ends a midnight, so I recommend using visitStartTime instead. Note that the visitStartTime will only be unique per user, so we need to concatenate the visitStartTime with the fullVisitorId to get a globally unique session ID.
Using this session identifier, we can now compute unique pageviews.
SELECT
pagepath,
COUNT(*) AS pageviews,
COUNT(DISTINCT session_id) AS unique_pageviews
FROM (
SELECT
hits.page.pagePath,
CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) AS session_id
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits
WHERE
hits.type = 'PAGE')
GROUP BY
pagePath
ORDER BY
pageviews DESC
Average Time on Page
The formula for average time on page is: Total Time on Page / (Pageviews - Exits)
.
We will need to compute each of the components of this formula separately, then compute the average using the formula above.
Pageviews
Easy – we have already calculated this above.
SELECT
hits.page.pagePath,
COUNT(*) AS pageviews
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits
WHERE
hits.type = 'PAGE'
GROUP BY
hits.page.pagePath
ORDER BY
pageviews DESC
Exits
Fortunately, exit pages are designated with the hits.isExit field. We just need to count these up to get total exits on the page.
SELECT
pagePath,
SUM(exits) AS exits
FROM (
SELECT
hits.page.pagePath,
CASE
WHEN hits.isExit IS NOT NULL THEN 1
ELSE 0
END AS exits
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits)
GROUP BY
pagePath
ORDER BY
exits DESC
Total Time on Page
This one is tricky. Recall that Google Analytics calculates the time on page by comparing the timestamps of hits sent in on different pages. For non-exit pageviews, the time on page is computed by subtracting the timestamp of the next pageview hit from the timestamp of the current pageview hit. For any pageview that is an exit, we compute the time on page as the difference between the timestamp of the last interaction hit minus the timestamp of the pageview hit.
If a page is an exit page and it’s the only page of a session, and there are no interaction events, then this page is considered a bounce and does not contribute to total time on page. Put more simply, bounces do not affect the avg time on page metric.
The timestamp information for each hit can be found in the hits.time field. Note that this field is measured in milliseconds, so you will need to divide the final time by 1,000 to get back to seconds. For each pageviews, we need to compute two additional columns – the timestamp for the next page as well as the timestamp for the last interaction hit.
We can compute the timestamp for the next pageview (within that session) by using a lead window function.
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
FROM (
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
hits.time / 1000 AS hit_time
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits
WHERE
hits.type = 'PAGE')
We can create a new column with the timestamp for the last interaction hit by finding the maximum timestamp out of all of the interaction events. Interaction hits are designated by the field hits.isInteraction.
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
MAX(IF(hits.isInteraction IS NOT NULL,
hits.time,
0)) OVER (PARTITION BY fullVisitorId, visitStartTime) as last_interaction
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801',
UNNEST(hits) AS hits
WHERE
hits.type = 'PAGE'
Be careful when combining these two queries. Filtering in the where clause is a little tricky because we need to include events to get a last interaction hit value. Therefore, we need to be a little patient before calculating the next_pageview timestamp.
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
last_interaction,
LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
FROM (
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
last_interaction
FROM (
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
hits.type,
hits.isExit,
hits.time / 1000 AS hit_time,
MAX(IF(hits.isInteraction IS NOT NULL,
hits.time / 1000,
0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits)
WHERE
type = 'PAGE')
Now we can compute time on page by comparing these two columns to the hits.time of the pageview.
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
CASE
WHEN isExit IS NOT NULL THEN last_interaction - hit_time
ELSE next_pageview - hit_time
END AS time_on_page
FROM (
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
last_interaction,
LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
FROM (
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
last_interaction
FROM (
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
hits.type,
hits.isExit,
hits.time / 1000 AS hit_time,
MAX(IF(hits.isInteraction IS NOT NULL,
hits.time / 1000,
0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits)
WHERE
type = 'PAGE'))
Put It All Together
Finally, we are ready to aggregate and compute the average time on page.
SELECT
pagePath,
pageviews,
exits,
total_time_on_page,
CASE
WHEN pageviews = exits THEN 0
ELSE total_time_on_page / (pageviews - exits)
END AS avg_time_on_page
FROM (
SELECT
pagePath,
COUNT(*) AS pageviews,
SUM(IF(isExit IS NOT NULL,
1,
0)) AS exits,
SUM(time_on_page) AS total_time_on_page
FROM (
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
CASE
WHEN isExit IS NOT NULL THEN last_interaction - hit_time
ELSE next_pageview - hit_time
END AS time_on_page
FROM (
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
last_interaction,
LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
FROM (
SELECT
fullVisitorId,
visitStartTime,
pagePath,
hit_time,
type,
isExit,
last_interaction
FROM (
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
hits.type,
hits.isExit,
hits.time / 1000 AS hit_time,
MAX(IF(hits.isInteraction IS NOT NULL,
hits.time / 1000,
0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits)
WHERE
type = 'PAGE')))
GROUP BY
pagePath)
ORDER BY
pageviews DESC
Entrances
Fortunately, entrances by page are easy to compute. There is a field called hits.isEntrance that we can use to determine whether that pageview is an entrance.
SELECT
pagePath,
SUM(entrances) AS entrances
FROM (
SELECT
hits.page.pagePath,
CASE
WHEN hits.isEntrance IS NOT NULL THEN 1
ELSE 0
END AS entrances
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits)
GROUP BY
pagePath
ORDER BY
entrances DESC
Bounce Rate
Bounce rate is another complicated field in BigQuery. The formula for bounce rate is: Bounces / Sessions
. We will need to compute bounces and sessions separately.
Bounces
Bounces are attributed to the first interaction hit in a session in which there is exactly one interaction event. We can determine if there was exactly one interaction event in the session by using the totals.bounces field. Now we just need to find the first interaction hit in the session.
The hits.hitNumber field will work well for this task. We will just need to use a window function to identify the hit number for the first interaction hit in the session. (Note that we can determine if a hit is an interaction hit by using the hits.isInteraction field.)
SELECT
fullVisitorId,
visitStartTime,
pagePath,
CASE
WHEN hitNumber = first_interaction THEN bounces
ELSE 0
END AS bounces
FROM (
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
totals.bounces,
hits.hitNumber,
MIN(IF(hits.isInteraction IS NOT NULL,
hits.hitNumber,
0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_interaction
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits)
Sessions
Sessions are attributed to the first hit (interaction or not) in a session where there is at least one interaction event. (Keep in mind that sessions are closely related to but not the same as entrances). Fortunately, we can use the totals.sessions field to identify whether there is at least one interaction event. Now we just need to identify the first hit in the session.
To find the first hit in the session, it is very tempting to just check if hits.hitNumber equals 1. Unfortunately, like visitId, this field does not restart at midnight. So, we will need to use a window function to identify the first hit number in the session. We can then use this column to compute the number of sessions associated with each page.
SELECT
fullVisitorId,
visitStartTime,
pagePath,
CASE
WHEN hitNumber = first_hit THEN visits
ELSE 0
END AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
totals.visits,
hits.hitNumber,
MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits)
Aggregate and Combine
Now we just need to aggregate bounces and sessions then divide.
select
pagePath,
bounces,
sessions,
CASE
WHEN sessions = 0 THEN 0
ELSE bounces / sessions
END AS bounce_rate
from (
SELECT
pagePath,
SUM(bounces) AS bounces,
SUM(sessions) AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
pagePath,
CASE
WHEN hitNumber = first_interaction THEN bounces
ELSE 0
END AS bounces,
CASE
WHEN hitNumber = first_hit THEN visits
ELSE 0
END AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
hits.page.pagePath,
totals.bounces,
totals.visits,
hits.hitNumber,
MIN(IF(hits.isInteraction IS NOT NULL,
hits.hitNumber,
0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_interaction,
MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits))
GROUP BY
pagePath)
ORDER BY
sessions DESC
% Exit
The formula for % Exit is: Exits / Pageviews
.
We have already calculated pageviews and exits above. Now we just need to combine these into a single query.
SELECT
pagePath,
pageviews,
exits,
CASE
WHEN pageviews = 0 THEN 0
ELSE exits / pageviews
END AS exit_rate
FROM (
SELECT
pagepath,
COUNT(*) AS pageviews,
SUM(exits) AS exits
FROM (
SELECT
hits.page.pagePath,
CASE
WHEN hits.isExit IS NOT NULL THEN 1
ELSE 0
END AS exits
FROM
'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
UNNEST(GA.hits) AS hits
WHERE
hits.type = 'PAGE')
GROUP BY
pagePath)
ORDER BY
pageviews DESC
Page Value
Calculating page value is beyond the scope of this post. In general, I would recommend using the Google Analytics Core Reporting API to pull the page value or create your own custom content scoring method.
Accessing standard Google Analytics metrics through BigQuery can be more painful and time-consuming than using the Core Reporting API. However, BigQuery can give you the power to add more advanced reporting and analysis to the standard Google Analytics reports, and adding in those standard metrics can help provide context to your analysis.