I’ve recently spent a lot of time doing various forms of business analytics in BigQuery. As discussed in a previous post, I’ve been using BigQuery as the data integration environment for several business systems. I’ve found integration at the data level via an ETL/ELT/IPaaS pipeline to be a lot more stable than system-level integrations that involve chaining together dependencies on fairly volatile SaaS APIs.
The original premise was fairly straightforward: Given a table of user-level statistics over time, identify only those points in time where one or more of the statistics changed value. In our case, we had several million rows of user-level data captured on a daily cadence. Manually inspecting this data for changes in individual values by customer was simply not a viable plan. The BigQuery LAG function came to the rescue.
LAG is a navigation function that finds the value of a column on a preceding row. By default, it will find the value from the row immediately preceding the current one. The example SQL from the BigQuery documentation works well to explain the concept:
WITH finishers AS (SELECT 'Sophia Liu' as name, TIMESTAMP '2016-10-18 2:51:45' as finish_time, 'F30-34' as division UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39' UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34' UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39' UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39' UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39' UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34' UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34' UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29' UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34') SELECT name, finish_time, division, LAG(name) OVER (PARTITION BY division ORDER BY finish_time ASC) AS preceding_runner FROM finishers;
Given some arbitrary race data, the LAG function is used to find the finishing order of each division. It’s not hard to imagine the multi-step SQL that could be used to accomplish the same analysis, but the LAG function eases the process greatly.
This was enough to get me going and I was able to produce the analysis of our various metrics in fairly short order. But, as a recovering geospatial technologist, I immediately wondered if I could apply this approach to spatial data. I quickly made some sample data to work with. Here is a screen shot of an extract:
If you stare at the ugly WKT dump of the geometry long enough, you’ll notice that this sample asset tends to spend multiple days in a location. What I wanted to determine is specifically when this asset changed location. I needed a variation on the query above.
In this case, I am using the LAG function to identify the previous value of the asset’s location and the last date it was at that location. The BigQuery geography type can’t participate in the LAG function, so I had to convert it to WKT first, as can be seen in the subquery inside the “data” object.
The primary job of the “data” object is to attach the WKT from each preceding row to the current row. The “
WHERE wkt<>prev-loc” clause does the real work. It’s here that we return only the rows where the location of the asset changed.
Close examination of the capture date shows that the asset often spends a few days at one location before moving on. This result is far more informative than the full list of location data we started with.
If you want to take a quick look at your results on a map, you can use the BigQuery Geo Viz tool, as depicted at the top of this post. By most web mapping standards, that’s fairly rudimentary, but it’s easy enough to use BigQuery SQL to export the data to CSV or GeoJSON for use in more sophisticated mapping tools.
My next step will be to attach the number of days spent at each location to support more meaningful map visualization, then probably generating a traveling salesman analysis to identify possible routes based on the time sequence.
PostGIS is still my first love and go-to for all things geospatial, but I’m finding BigQuery to have just enough geo to meaningfully enrich the business analytics I am doing there.