What's Missing In The Google Analytics BigQuery Export Schema?
Google Analytics (via reports or the API) typically deals with aggregated data, where metrics are already summed and averaged for you, and you can easily request a tabular report of (say) Sessions by Date. On the other hand, the Google Analytics BigQuery Export Schema contains a wealth of raw data.
While the raw data opens up infinite possibilities, it also means that most Google Analytics Metrics and some Dimensions are not included in the export. These pieces of information, such as ga:hasSocialSourceReferral
, ga:channelGrouping
, ga:daysSinceLastSession
, ga:pagePathLevel1
, ga:landingPagePath
, ga:contentGroupXX
, and ga:previousPagePath
need to be computed.
Perhaps most importantly, the goals that we’ve configured inside of Google Analytics are not stored in BigQuery and will need to be computed from scratch.
Brief Recap of the BigQuery Schema
BigQuery is a structured, table-based SQL database. In the BigQuery export, each row represents a session. Inside each session is the hit, custom dimensions, and other information about the session and hits. Below is an illustration of some of the fields within the export.
Note that the session-level custom dimensions hits are repeated within the session and how the hit-level custom dimensions are repeated within each hit; this is one of the special properties of BigQuery: repeated fields.
Also note how the custom dimensions, hits, and totals have named fields within them; this is another one of BigQuery’s special properties: nested records.
Recreating Metrics
Since no metrics are contained within BigQuery, let us first examine methods to compute them. If you’re familiar with Custom Reports or the Google Analytics API, then you’re familiar with the concept of having metrics computed for a group of dimensions. In BigQuery, and SQL at large, that concept translates into aggregates and GROUP BY
.
A simple example is the number of sessions per day (metric=ga:sessions
, dimension=ga:date
). We need to aggregate the sessions in the export by counting them, GROUP
ed BY
date
.
SELECT date,
SUM(totals.visits) AS sessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY date
date | sessions |
---|---|
120130910 | 63 |
SELECT date, Filters can be performed with a WHERE
clause, when filtering data in the table, or HAVING
, when filtering an aggregate value. To count only sessions with transactions, we can filter on totals.transactions
.
SUM(totals.visits) AS sessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE totals.transactions > 0
GROUP BY date
date | sessions |
---|---|
120130910 | 16 |
HAVING
is similar to a WHERE
but works on aggregate values, e.g. metrics. Using the above example, we could find all days with more than 70 sessions. (Since the LondonCycleHelmet dataset is only a single day, the results are immediately useful.)
SELECT date,
SUM(totals.visits) AS sessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY date
HAVING sessions > 70
date | sessions |
---|---|
Zero Results returned |
Now let us consider a slightly more advanced example: computing the metric ga:percentNewSessions
with dimensions of ga:medium
, we can aggregate the number of sessions by a count of all and a count of new sessions, GROUP
ed BY
trafficSource.medium
.
(We, unfortunately, cannot aggregate by averaging the new sessions flag in the export, because it is set to NULL
, instead of 0, if the session is not new; and aggregates ignore null values, i.e. average would return 1.)
SELECT trafficSource.medium,
SUM(totals.newVisits) / SUM(totals.visits) AS percentNewSessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY trafficSource.medium
trafficSource_medium | percentNewSessions |
---|---|
referral | 0.5714285714285714 |
organic | 0.4782608695652174 |
cpc | 0.6 |
(none) | 0.7222222222222222 |
We can also compute multiple metrics at once. Additionally, we are not limited to the 7 dimensions and 10 metrics that the Google Analytics API limits us to.
SELECT
CONCAT(trafficSource.source,"/",trafficSource.medium) AS sourceMedium,
newSessions / sessions AS percentNewSessions,
bounces / sessions AS bounceRate
FROM (
SELECT
trafficSource.source,
trafficSource.medium,
SUM(totals.visits) AS sessions,
SUM(totals.newVisits) AS newSessions,
SUM(totals.bounces) AS bounces,
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY
trafficSource.source,
trafficSource.medium)
sourceMedium | percentNewSessions | bounceRate |
---|---|---|
technologysauce.com/referral | 0.5714285714285714 | 0.14285714285714285 |
google/organic | 0.4782608695652174 | 0.21739130434782608 |
google/cpc | 0.6 | 0.2 |
(direct)/(none) | 0.7222222222222222 | 0.1111111111111111 |
Interesting New Metrics
Since we can leverage all of BigQuery against our raw data, we can compute metrics that dont’t exist in Google Analytics or the API, e.g. quantiles or bucketing/binning.
Here we’ll compute the 25%ile, median, and 75%ile of ga:sessionDuration
(totals.timeOnSite
in BigQuery). We’re also showing of the ability, and common usage pattern, of using a subquery to do a calculation, or convert data for usage later on in the query.
SELECT
trafficSource.medium,
SUM(totals.visits) AS sessions,
-- 5 will give the min, 25%, 50%, 75%, max with 20% error
-- the more buckets, the better the approximation (error = 1/number of buckets)
-- at the cost of more computation
-- QUANTILES returns all of the buckets, use NTH to extract the bucket you want
NTH(2, QUANTILES(totals.timeOnSite, 5)) AS firstQuartile,
NTH(3, QUANTILES(totals.timeOnSite, 5)) AS mean,
NTH(3, QUANTILES(totals.timeOnSite, 5)) AS thirdQuartile
FROM (
SELECT
trafficSource.medium,
totals.visits,
-- Sessions with a single page view will have no time on site reported
IF(totals.timeOnSite IS NULL, 0, totals.timeOnSite) AS totals.timeOnSite
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
GROUP BY
trafficSource.medium
trafficSource_medium | sessions | firstQuartile | mean | thirdQuartile |
---|---|---|---|---|
referral | 7 | 7 | 20 | 20 |
organic | 23 | 5 | 18 | 18 |
cpc | 15 | 13 | 23 | 23 |
(none) | 18 | 8 | 18 | 18 |
Now we’ll bucket/bin our data in preparation for a histogram.
SELECT
trafficSource.medium,
SUM(totals.visits) AS sessions,
SUM(IF(timeOnSite_bucket5sec = 0,1,0)) AS B0_4,
SUM(IF(timeOnSite_bucket5sec = 1,1,0)) AS B5_9,
SUM(IF(timeOnSite_bucket5sec = 2,1,0)) AS B10_14,
SUM(IF(timeOnSite_bucket5sec = 3,1,0)) AS B15_19,
SUM(IF(timeOnSite_bucket5sec = 4,1,0)) AS B20_24,
SUM(IF(timeOnSite_bucket5sec = 5,1,0)) AS B25_29,
SUM(IF(timeOnSite_bucket5sec > 5,1,0)) AS B30_
FROM (
SELECT
trafficSource.medium,
totals.visits,
-- Sessions with a single page view will have no time on site reported
FLOOR(IF(totals.timeOnSite IS NULL, 0, totals.timeOnSite) / 5) AS timeOnSite_bucket5sec
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
GROUP BY
trafficSource.medium
trafficSource_medium | sessions | B0_4 | B5_9 | B10_14 | B15_19 | B20_24 | B25_29 | B30_ |
---|---|---|---|---|---|---|---|---|
referral | 7 | 1 | 1 | 0 | 1 | 1 | 1 | 2 |
organic | 23 | 5 | 3 | 2 | 2 | 2 | 1 | 8 |
cpc | 15 | 3 | 0 | 1 | 1 | 3 | 1 | 6 |
(none) | 18 | 2 | 4 | 1 | 2 | 1 | 0 | 8 |
Dimensions
Some dimensions are simple to compute, while others require more ingenuity. Often these require subqueries, like we have seen earlier.
Extracting Dimensions from Other Fields
ga:pagePathLevel1
is an example of an easy-to-extract dimension. It is the first segment of the ga:pagePath
(hits.page.pagePath
in BigQuery), which we can pull out using SPLIT
and NTH
.
SELECT
hits.page.pagePath,
NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1,
NTH(2, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel2
FROM
-- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected
FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath)
hits_page_pagePath | pagePathLevel1 | pagePathLevel2 |
---|---|---|
/helmets/foldable.html | helmets | foldable.html |
/ | null | null |
/vests/ | vests | null |
/vests/yellow.html | vests | yellow.html |
However, we still need to bring along values we want to use later. For instance, if we wanted to compute the bounce rate per ga:pagePathLevel1
, we should bring along totals.bounces
.
SELECT
pagePathLevel1,
ROUND(SUM(totals.bounces) / SUM(totals.visits), 2) AS bounceRate
FROM (
SELECT
totals.visits,
-- If no session bounced, then SUM of nulls is null, not 0
IF(totals.bounces IS NULL, 0, 1) AS totals.bounces,
NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1,
FROM
-- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected
FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath))
GROUP BY
pagePathLevel1
pagePathLevel1 | bounceRate |
---|---|
helmets | 0.07 |
null | 0.09 |
vests | 0.03 |
login.html | 0.0 |
Beyond simple transformations on fields, sometimes we want to create “arbitrary” groupings. In the case of Content Groupings that can be based on data already in the export (e.g. URL or Custom Dimension), we can tease it out via a CASE
statement, among other ways.
SELECT
hits.page.pagePath,
CASE
WHEN pagePathLevel1 = 'helmets' THEN "Products"
WHEN pagePathLevel1 = 'vests' THEN "Products"
WHEN pagePathLevel1 IN ('basket.html', 'shipping.html', 'billing.html', 'confirm.html') THEN "Checkout"
WHEN pagePathLevel1 IS NULL OR pagePathLevel1 = 'login.html' THEN "RestOfSite"
ELSE 'Unknown'
END AS contentGroup1,
CASE
WHEN pagePathLevel2 = 'foldable.html' THEN "Foldable"
WHEN pagePathLevel2 IN ('orange.html', 'yellow.html') THEN "Color"
END AS contentGroup2
FROM (
SELECT
fullVisitorId,
visitID,
hits.page.pagePath,
NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1,
NTH(2, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel2,
FROM
-- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected
FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath))
GROUP BY
hits.page.pagePath,
contentGroup1,
contentGroup2
hits_page_pagePath | contentGroup1 | contentGroup2 |
/helmets/foldable.html | Products | Foldable |
/ | RestOfSite | null |
/vests/ | Products | null |
/vests/yellow.html | Products | Color |
/vests/orange.html | Products | Color |
/login.html | RestOfSite | null |
/basket.html | Checkout | null |
Slightly More Interesting Example
ga:landingPagePath
is similarly easy to find using row-scoped aggregates; what that means is that for each row, we’re going to compute an aggregate based on a repeated value, in this case the hits.page.pagePath
and hits.page.type
. Since a session need not start with a pageview, we can’t simply grab all the hits with a hit.hitNumber
equal to 1. To solve this problem, we’ll take the first hits.page.pagePath
where hits.page.type
is “PAGE”.
SELECT
fullVisitorId,
visitId,
-- Uses the knowledge that hits are stored in chronological order
FIRST(IF(hits.type = "PAGE", hits.page.pagePath, NULL)) WITHIN RECORD AS landingPagePath
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
fullVisitorId | visitId | landingPagePath |
---|---|---|
380066991751227408 | 1378805776 | /helmets/foldable.html |
712553853382222331 | 1378804218 | /vests/ |
881288060286722202 | 1378803865 | /helmets/ |
881288060286722202 | 1378804975 | / |
More Complex Example
For a metric like ga:daysSinceLastSession
that requires knowledge of more than one session, in this case, a user’s previous session, we can use windowing functions.
Windowing functions allow us to compute a value for the current row given the value of other rows in the “window” of data we’re looking at. In the following example, LAG
returns the row before the row being looked at as defined by PARTITION BY fullvisitorid ORDER BY visitStartTime ASC
. PARTITION BY
acts much like a GROUP BY
clause, in that it creates new windows for each unique value of the fields listed. ORDER BY
sorts the window according to the fields and order listed.
We can read LAG(visitStartTime) OVER (PARTITION BY fullvisitorid ORDER BY visitStartTime ASC)
as “looking at only the sessions for a given user, ordered by visitStartTime
, what was the visitStartTime
previous to the the row I’m currently on”.
SELECT
fullvisitorid,
visitId,
DATEDIFF(SEC_TO_TIMESTAMP(visitStartTime), SEC_TO_TIMESTAMP(prevVisitStartTime)) AS daysSinceLastSession,
FLOOR((visitStartTime - prevVisitStartTime)/60) as minutesSinceLastSession
FROM (
SELECT
fullvisitorid,
visitId,
visitStartTime,
LAG(visitStartTime) OVER (PARTITION BY fullvisitorid ORDER BY visitStartTime ASC) AS prevVisitStartTime
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
fullvisitorid | visitId | daysSinceLastSession | minutesSinceLastSession |
---|---|---|---|
3960256913998800485 | 1378805348 | null | null |
4158255675143559999 | 1378818879 | null | null |
4158255675143559999 | 1378819831 | 0 | 15.0 |
We are finding the visitStartTime
of the previous session chronologically, and combining it with the visitStartTime
of the current session to compute the days between the previous and current session. Above prevVisitStartTime
will be NULL
if there is no previous session, causing DATEDIFF
return a NULL
as well.
(The LondonCycleHelmet dataset is only of a single day, so we will only see NULL
for the first session and 0 for the subsequent ones. That is why I also included minutesSinceLastSession
to make the output a little more exciting!)
Even More Complicated Dimensions
Some dimensions, such as ga:channelGrouping
and ga:hasSocialSourceReferral
, require more computation that you can (or are willing to) write as a SQL statement. For this we leverage User Defined Functions, (UDFs) in BigQuery. UDFs are pieces of JavaScript that run in V8 on the same machine your data is on. For more information see my post, Self-Joins, Windowing, and User Defined Functions in BigQuery, or the BigQuery Documentation. UDFs take in a row of data, and return 0 or more rows.
In our case, we can use them to compute the Default Channel Grouping found in Google Analytics.
UDF:
function computeDefaultChannelGroup(row, output)
{
if( row.trafficSourceSource == '(direct)'
&& ( row.trafficSourceMedium == '(not set)'
|| row.trafficSourceMedium == '(none)')
) {
output("Direct");
}
else if (row.trafficSourceMedium=="organic") {
output("Organic");
}
else if (row.trafficSourceMedium=="referral") {
output("Referral");
}
else if (row.trafficSourceMedium=="email")
{
output("Email");
}
else if ( ( row.trafficSourceMedium == "cpv"
|| row.trafficSourceMedium == "cpa"
|| row.trafficSourceMedium == "cpp"
|| row.trafficSourceMedium == "content-text")
) {
output("Other Advertising");
}
else if ( ( row.trafficSourceMedium == "cpc"
|| row.trafficSourceMedium == "ppc"
|| row.trafficSourceMedium == "paidsearch")
&& row.adNetworkType != "Content"
) {
output("Paid Search");
}
else if ( ( row.trafficSourceMedium == "social"
|| row.trafficSourceMedium == "social-network"
|| row.trafficSourceMedium == "social-media"
|| row.trafficSourceMedium == "sm"
|| row.trafficSourceMedium == "social network"
|| row.trafficSourceMedium == "social media")
|| false // Social Source Referral exactly match Yes
) {
output("Social");
}
else if ( ( row.trafficSourceMedium == "display"
|| row.trafficSourceMedium == "cpm"
|| row.trafficSourceMedium == "banner")
|| row.adNetworkType == "Content"
) {
output("Display");
}
else
{
output("Unknown");
}
}
function defaultChannelGroup(row, emit)
{
function output(channelGroup) {
emit({channelGroup:channelGroup,
fullVisitorId: row.fullVisitorId,
visitId: row.visitId,
});
}
computeDefaultChannelGroup(row, output);
}
bigquery.defineFunction(
'defaultChannelGroup', // Name of the function exported to SQL
['trafficSourceMedium', 'trafficSourceSource', 'fullVisitorId', 'visitId','adNetworkType'], // Names of input columns
[{'name': 'channelGroup', 'type': 'string'}, // Output schemac
{'name': 'fullVisitorId', 'type': 'string'},
{'name': 'visitId', 'type': 'integer'},
],
defaultChannelGroup // Reference to JavaScript UDF
);
SQL:
SELECT
fullVisitorId,
visitId,
channelGroup
FROM defaultChannelGroup(
SELECT
trafficSource.Medium AS trafficSourceMedium,
trafficSource.Source AS trafficSourceSource,
fullVisitorId,
visitId,
-- LCH is too old to have the needed field
-- trafficSource.adwordsClickInfo.adNetworkType
NULL AS adNetworkType
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
fullVisitorId | visitId | channelGroup |
---|---|---|
380066991751227408 | 1378805776 | Referral |
712553853382222331 | 1378804218 | Organic |
881288060286722202 | 1378803865 | Organic |
881288060286722202 | 1378804975 | Organic |
881288060286722202 | 1378805870 | Organic |
1677140157296205498 | 1378803386 | Paid Search |
Note that above doesn’t take into account social referrers, as these are not in the export. A previous post provides a list we can use to build up social refers to check against.
Content Groupings, if too complicated to do with a CASE
and/or REGEXP_EXTRACT
, can also be computed via UDFs.
Wrap Up
Hopefully this post helps demystify what is and isn’t contained in the BigQuery export and gives enough examples and information on to compute missing dimensions and metrics from the export. BigQuery is an extremely flexible tool that can free your analysis from the constraints of the API, even if everything the API provides isn’t immediately available.