Self‑Joins, Windowing, And User Defined Functions In BigQuery
Google’s BigQuery offers unprecedented access to Google Analytics data for Google Analytics 360 (Premium) customers. With great data comes great challenges, especially when you start attempting complicated calculations beyond the traditional metrics in Google Analytics.
There are many ways to write queries in Google’s BigQuery, each has its strengths and weakness. Some of the basics of writing queries were covered in a previous post, but here we’re going to look at three different approaches to writing complex queries. As with any situation where there are multiple options, each option has use-cases in which they simplify the overall amount of effort needed to reach the goal.
The first method, using a self-join, is flexible and straight-forward, but can be very slow. The second, using windowing functions, is efficient and fairly straight-forward, but also limiting in what exactly can be queried. The last method we’ll discuss, User Defined Functions (UDFs), can be efficient on certain types of computations and can make expressing certain types of functions more straight-forward, but comes at the cost of speed.
Self-Joins
Self-Joins are when you join a table to itself, and one of many different ways to join a table to other datasources. Why would you ever want to join a table to itself? The most common reason is to find rows in a table that are related to each other. While this provides a great deal of flexibility, joins in BigQuery are inefficient — the larger the “smaller” table becomes, the more data needs to be shipped between nodes.
When self-joining, it’s possible to get into a situation where the entire table needs to be shipped to every node working on the query, as opposed to just the single, or small handful, that it would need otherwise.
An example of a query that could only be answered with a self-join would be a question like this: of people who viewed one specific page, which other pages did they view and how many times?
(Note: the Cross Join used below is not what makes a self join, but that the table being joined is the same. The Join Type is irrelevant.)
SELECT
sp as pageA,
pp as pageB,
COUNT(*) AS crossCount
FROM (
SELECT
sess.fullvisitorid AS sfvi,
pages.fullVisitorid AS pfvi,
sess.hits.page.pagePath AS sp,
pages.hits.page.pagePath AS pp
FROM
FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS sess
CROSS JOIN
FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS pages
WHERE
-- ensures each combo is only counted a single time
pages.fullvisitorid < sess.fullvisitorid
AND pages.hits.page.pagePath < sess.hits.page.pagePath
GROUP BY sfvi, pfvi, sp, pp )
GROUP BY pageA, pageB
ORDER BY pageA, crossCount DESC
This could then loaded into a spreadsheet and a pivot table could be used to look at trends in how people visit pages.
Another example would be to find the the page viewed before a page, we could find all related pages in a session using a self-join, filter out, using a WHERE
clause because in BigQuery join conditions, in the ON
, cannot have inequalities, all hits who have greater hit numbers, and then aggregating all the results for each hit and finding the greatest pageview less than the current one.
(Note: Again I’ll point out that the Inner Join used below is not what makes a self join, but that the table being joined is the same. The Join Type is irrelevant.)
SELECT
cur_hit.fullVisitorId AS fullVisitorId,
cur_hit.visitId AS visitID,
cur_hit.hits.hitNumber AS cur_hitnumber,
cur_hit.hits.page.pagePath as cur_pagePath,
cur_hit.hits.time AS cur_time,
MAX(prev_hit.hits.hitNumber) AS prev_hitNumber,
FROM
FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS prev_hit
INNER JOIN
FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS cur_hit
ON
prev_hit.fullvisitorid = cur_hit.fullvisitorid
AND prev_hit.visitid = cur_hit.visitid
WHERE
prev_hit.hits.hitNumber < cur_hit.hits.hitNumber
AND prev_hit.hits.type = "PAGE"
AND cur_hit.hits.type = "PAGE"
GROUP BY fullVisitorId, visitID, cur_hitnumber, cur_pagePath, cur_time
However, as we will see in the following sections, this particular query can be done much more cleanly with windowing functions.
Windowing Functions
Windowing Functions allow you to view smaller portions of your data by allowing access to rows before and after the row currently being processed, in essence opening up “window” that allows you to see more than just the current row and to be able to see your data in different ways.
To help us understand this concept, let’s look at a simple example. Our “table” is the sequence 2,4,3,1 in a field named “num”.
SELECT
LAG(num) OVER () AS lag,
num,
LEAD(num) OVER () AS lead,
FROM
( SELECT 2 AS num),
( SELECT 4 AS num),
( SELECT 3 AS num),
( SELECT 1 AS num),
Note the odd OVER ()
for LAG
and LEAD
. This tells LAG
and LEAD
to use the entire table as their window. This is similar to the manner in which aggregate functions behave. More on that later!
Row |
lag |
num |
lead |
1 | null | 2 | 4 |
2 | 2 | 4 | 3 |
3 | 4 | 3 | 1 |
4 | 3 | 1 | null |
Note that LAG
and LEAD
are processing the rows in order of their appearance in the table. If we want them to process rows in sorted order by specifying their order in the OVER
clause, as each window can be ordered by different values and in different directions, or in an ORDER BY
clause.
SELECT
LAG(num) OVER (ORDER BY num) AS ordered_lag,
LAG(num) OVER () AS lag,
num,
LEAD(num) OVER () AS lead,
LEAD(num) OVER (ORDER BY num) AS ordered_lead,
FROM
( SELECT 2 AS num),
( SELECT 4 AS num),
( SELECT 3 AS num),
( SELECT 1 AS num),
Row | ordered_lag | lag | num | lead | ordered_lead |
---|---|---|---|---|---|
1 | null | 3 | 1 | null | 2 |
2 | 1 | null | 2 | 4 | 3 |
3 | 2 | 4 | 3 | 1 | 4 |
4 | 3 | 2 | 4 | 3 | null |
Also note that the final output of rows is ordered by the window, not in table order as before. Even though the table is ordered, the LAG
and LEAD
OVER ()
return the values in the unordered window.
One of the many useful ways to use a windowing function is to modify the aggregate functions we know and love, e.g. SUM
, AVG
, STDEV
, MIN
, and MAX
.
SELECT
num,
MAX(num) OVER (ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) as max_here_1_following
FROM
( SELECT 2 AS num),
( SELECT 4 AS num),
( SELECT 3 AS num),
( SELECT 1 AS num),
Row |
num |
max_here_1_following |
1 | 2 | 4 |
2 | 4 | 4 |
3 | 3 | 3 |
4 | 1 | 1 |
As stated before, the windows need not be the same for every function. While this complicates understanding the query (remember! the query is ordered by the final ORDER BY
), it can be useful at times. Here is a slightly convoluted query illustrating these points.
SELECT
num,
AVG(num) OVER (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) as moving_avg,
AVG(num) OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) as moving_avg_middled,
LAG(num) OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) as lag_mid,
LEAD(num) OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) as lead_mid,
MAX(num) OVER (ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING ) as max_here_3_more,
MAX(num) OVER (ORDER BY num DESC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING ) as max_here_3_more_desc,
MIN(num) OVER (ORDER BY num ASC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING ) as max_here_3_more_asc,
FROM
( SELECT 2 AS num),
( SELECT 4 AS num),
( SELECT 3 AS num),
( SELECT 1 AS num),
Row |
num |
moving_avg |
moving_avg
|
lag_mid |
lead_mid |
max_here
|
max_here_3
|
max_here_3_
|
1 | 4 | 3.0 | 3.0 | 2 | 3 | 4 | 4 | 4 |
2 | 3 | 3.5 | 2.66 | 4 | 1 | 3 | 3 | 3 |
3 | 2 | 2.0 | 3.0 | null | 4 | 4 | 2 | 2 |
4 | 1 | 2.0 | 2.0 | 3 | null | 1 | 1 | 1 |
We’re not done with windowing functions yet! Remember how I said that aggregate functions behave as though they were OVER ()
? The GROUP BY
can be done within the window! However, they’re called “partitions” and you PARTITION BY
in a window.
SELECT
num,
COUNT(num) OVER (PARTITION BY num) AS partion_count,
COUNT(num) OVER () AS plain_count
FROM
( SELECT 2 AS num),
( SELECT 4 AS num),
( SELECT 3 AS num),
( SELECT 1 AS num),
( SELECT 1 AS num),
Row |
num |
partion_count |
plain_count |
1 | 4 | 1 | 5 |
2 | 3 | 1 | 5 |
3 | 1 | 2 | 5 |
4 | 1 | 2 | 5 |
5 | 2 | 1 | 5 |
Coming back to Google Analytics, and how to apply windowing functions to it, we can simplify the query to find time-on-page significantly, both in visual and execution complexity.
SELECT
fullVisitorId,
visitId,
hits.hitNumber,
hits.page.pagePath,
-- get next and previous hit time to be able to work out length of each hit
hits.time AS hit_time,
LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) AS next_hit_time,
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type = "PAGE"
For another example, one of the reasons we could have used a self-join would be to get the number of a certain kind of event in the next 2 hits following a pageview. We can do this succinctly (and efficiently!) with a windowing function! (“Jim, look. Surely you must know that the page is attached to the event and we can just do a simple GROUP BY
and COUNT
!” // “Yes, but what if we’re on a different page now! Maybe I want to know how many people login shortly after visiting a product.”).
SELECT
COUNT(*) as logins_after_product_view
FROM (
SELECT
SUM(IF(hits.eventinfo.eventCategory = "Login", 1, 0)) OVER (PARTITION BY fullvisitorid, visitid ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING) AS login_count,
hits.page.pagePath as pagePath
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
WHERE
login_count > 0
AND (pagePath CONTAINS 'vests' OR pagePath CONTAINS 'helments')
Row |
logins_after_product_view |
1 | 5 |
An interesting use of windowing functions is to compute which transaction or event a session precedes. Here we take advantage of a window function being executed over each row individually, but carrying the totals forward. If a session has a transaction, then we give it a 1
, otherwise a 0
. We then add up all of the values after the current session (we’re sorting descending). The sum will only increment if there was a transaction and stay constant for all sessions chronologically before the transaction until the next transaction. This gives us a value that is the same for the transaction and all sessions before it and after the previous transaction.
SELECT
fullVisitorId,
visitId,
SEC_TO_TIMESTAMP(visitStartTime) AS time,
transacted,
SUM(transacted) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime DESC) AS transaction_group
FROM (
SELECT
fullVisitorId,
visitId,
visitStartTime,
MAX(IF(hits.item.transactionID IS NOT NULL, 1, 0)) AS transacted
FROM (
SELECT
fullVisitorId,
visitId,
visitStartTime,
hits.item.transactionID
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
GROUP BY
fullVisitorId,
visitId,
visitStartTime)
WHERE
fullVisitorId='5711522334224447562'
fullVisitorId |
visitId |
time |
transacted |
transaction_group |
---|---|---|---|---|
5711522334224447562 | 1378805285 | 2013-09-10 09:28:05 UTC | 0 | 0 |
5711522334224447562 | 1378804654 | 2013-09-10 09:17:34 UTC | 1 | 1 |
5711522334224447562 | 1378803724 | 2013-09-10 09:02:04 UTC | 0 | 1 |
User Defined Functions
User Defined Functions (UDFs) are JavaScript functions running in the BigQuery database (using V8). These functions are run as “close” to your data as possible to minimize the amount of time spent shuttling data around. UDFs can do anything you can write a program to do, as such they are extremely powerful, and also extremely easy to abuse and over-complicate a query. More on the details of UDFs can be found in the BigQuery documentation.
The long and short is that the input to a UDF is a query whose output fields are named the same as the fields when registering the UDF (below). The UDF is run once per row, however the UDF may output multiple rows. Take for example, calculating the time on the page, the session and list of hits are a single input row, and multiple rows of output are generated giving the difference between each pageview.
SELECT
fullVisitorId,
visitId,
hitNumber,
pagePath,
diff AS timeOnPage_ms
FROM
timePerPage(
SELECT
fullVisitorId,
visitId,
NEST(hits.page.pagePath) AS pages,
NEST(hits.type) AS types,
NEST(hits.time) AS times
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY
fullVisitorId,
visitId
LIMIT
10)
This query also introduces the NEST
function, which aggregates all rows into a single repeated field. Repeated fields (similar to how hits and custom dimensions are normally stored) are exposed to our UDF as an array.
To edit the UDF, click on the UDF Editor button in the top right corner.
function timePerPage(row, emit) {
// old index -- index of the previous page
var oi = 0;
// index -- index of the current page
var i = 1;
for (; i < row.types.length; ) {
var prev_type = row.types[oi];
var prev_ts = row.times[oi];
var curr_type = row.types[i];
var curr_ts = row.times[i];
if (curr_type != 'PAGE') { i++; continue; }
if (prev_type != 'PAGE') { oi = i; i++; continue; }
emit({fullVisitorId: row.fullVisitorId,
visitId: row.visitId,
pagepath: row.pages[oi],
hitNumber: oi + 1,
diff: curr_ts - prev_ts});
oi = i;
i++;
}
}
// This is what allows you to use the function in BigQuery
bigquery.defineFunction(
'timeperpage', // Name in BigQuery
['fullVisitorId', 'visitId', 'types', 'times', 'pages'], // Input field names
[{'name': 'diff', 'type': 'integer'}, // Output Schema
{'name': 'pagepath', 'type': 'string'},
{'name': 'fullVisitorId', 'type': 'string'},
{'name': 'visitId', 'type': 'integer'},
{'name': 'hitNumber', 'type': 'integer'}],
timePerPage // reference to the javascript function to run
);
Because UDFs are a full-fledged programming language, we can compute multiple values or introduce complexities that would be hard to model in SQL. For instance, let’s say we’d like to combine subsequent pages if they are the same URL. We could simply update our function to not increment oi
or emit a row.
function timePerPage(row, emit) {
var oi = 0
var i = 1;
for (; i < row.types.length; ) {
var prev_type = row.types[oi];
var prev_ts = row.times[oi];
var curr_type = row.types[i];
var curr_ts = row.times[i];
if (curr_type != 'PAGE') { i++; continue; }
if (prev_type != 'PAGE') { oi = i; i++; continue; }
// consider subsequent pageviews the additional time on the first hit
if (row.pages[oi] == row.pages[i]) { i++; continue; }
emit({fullVisitorId: row.fullVisitorId,
visitId: row.visitId,
pagepath: row.pages[oi],
hitNumber: oi + 1,
diff: curr_ts - prev_ts});
oi = i;
i++;
}
}
Conclusion
BigQuery offers many powerful ways for you to leverage your data, but have caveats and fortes each their own. By playing around with the small London Cycle Helmet dataset, other example data Google provides, as well as your own data, you can feel your way through these different methods and learn how to leverage them to your advantage.