FOSS4G North America was an opportunity for me to reconnect with both community and technology. I enjoyed being able to both learn new things and refresh my skills with technologies such as PostGIS. I was reflecting on how, a couple weeks prior to the conference, I introduced the concept of PostgreSQL window functions to a junior team member in a different context. I have used them quite a bit in both PostgreSQL and BigQuery and FOSS4G got me thinking about their applicability in a spatial context. I have previously discussed them in terms of specific use cases, but they are a powerful tool that warrants an overview.
PostgreSQL window functions can perform calculations across a set of table rows related to the current row. When combined with PostGIS spatial queries, they can enable additional levels of spatial analysis. This post examines how to use PostgreSQL window functions with PostGIS spatial queries to gain deeper insights into geospatial data.
Understanding Window Functions
Window functions allow you to create a ‘window’ of related data over which you can perform calculations while still returning individual rows. Unlike aggregate functions that group rows into a single output row, window functions retain the granularity of the original dataset.
Common window functions include:
- ROW_NUMBER(): Assigns a unique sequential integer to rows within a partition of a result set.
- RANK(): Assigns a rank to each row within a partition of a result set, with gaps in the ranking for tied rows.
- LEAD(), LAG(): Provide access to rows that precede or succeed the current row in order to perform calculations based on values in those rows.
- SUM(), AVG(), MIN(), MAX(): Perform calculations across sets of rows that are somehow related to the current row.
Spatial Window Functions with PostGIS
To apply window functions in a spatial context, consider the following general steps:
- Partition the Data: Spatial queries often involve partitioning data by location or by some spatial attribute. For instance, you might partition data by geographical boundaries such as counties or neighborhoods.
- Ordering: Within each partition, you can order the data based on a spatial attribute, such as distance from a certain point.
- Frame Specification: Define the subset of the current partition to be used for calculations.
Here are some simple examples of how to use window functions with PostGIS. They are quite useful and I encourage you to dig deeper into the documentation and add them to your toolkit.
Example 1: Cumulative Distance Calculation
Suppose we have a table public_paths that stores a series of connected line segments. We want to calculate the cumulative distance for each segment from the starting point.
SELECT id, ST_Length(geom) AS segment_length, SUM(ST_Length(geom)) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_distance FROM public_paths;
In this query, the window function SUM() calculates the total length from the start to each segment sequentially by ordering the rows by their `id`.
Example 2: Ranking Points of Interest by Proximity
Imagine a table points_of_interest with a column `geom` holding point geometries. We want to rank these points based on their proximity to a reference location.
WITH ref_location AS (SELECT ST_SetSRID(ST_MakePoint(-73.985656, 40.748817), 4326) AS geom) SELECT p.id, ST_Distance(p.geom, r.geom) AS distance_from_ref, RANK() OVER (ORDER BY ST_Distance(p.geom, r.geom)) AS proximity_rank FROM points_of_interest p, ref_location r;
The window function RANK() orders points of interest by their distance from a given reference location.
Example 3: Analyzing Sequential Events
Consider a vehicle_movements table that records the positions of buses as they travel their routes. You might want to analyze stop durations by comparing the arrival and departure times at each stop.
SELECT vehicle_id, stop_id, arrival_time, LEAD(arrival_time, 1) OVER (PARTITION BY vehicle_id, stop_id ORDER BY arrival_time) - arrival_time AS stop_duration FROM vehicle_movements;
The LEAD() window function is used here to look at the following row to calculate the duration of each stop.
Example 4: Neighborhood Analysis
If you’re analyzing demographic data per neighborhood, you might want to calculate running totals or averages within a city’s neighborhoods.
SELECT neighborhood, population, AVG(population) OVER (PARTITION BY neighborhood) AS avg_neighborhood_population FROM city_demographics;
This query calculates the average population per neighborhood using the AVG() window function partitioned by neighborhood.
Example 5: Path Interpolation
For applications like tracking wildlife movement, you might need to interpolate points along a path. Window functions can order these observations and then use spatial functions to interpolate points between them.
SELECT id, ST_LineInterpolatePoint( geom, 1.0 / COUNT(*) OVER (PARTITION BY track_id ORDER BY timestamp) ) AS interpolated_point FROM animal_tracks;
In this case, ST_LineInterpolatePoint() generates interpolated points along the movement paths of animals.
While window functions are powerful, they can be computationally intensive, especially when applied to large geospatial datasets. You should only perform window functions on the data you need. Using subqueries or common table expressions (CTEs) can narrow down your dataset prior to analysis.