GeoParquet Backup and Restore

For the past two years, have been working with a fiber-to-the-home (FTTH) provider supporting various data architecture and geospatial activities. This is a classic infrastructure business that cycles through phases of design, construction, and maintenance as the built environment changes. This leads to a lot classic GIS editing over time – which can mean the occasional human error requiring reset or recovery to a previous state.

Of course, data versioning can help mitigate this problem and this customer implements versioning, but as anyone who has resolved merge conflicts either in code or in geodatabase versions can attest, versioning is not a perfect solution. A form of spatially-aware backup can provide an additional layer of protection.

Spatial Backup

Backup and restore is typically thought of in terms of disaster recovery. Incremental and full backups of your data are done on a schedule you set up. When something goes wrong, your data is restored from the most recent backup. In this case, the goal is usually to achieve a recovery point objective (RPO) which seeks to reduce data loss or a recovery time objective (RTO) which seeks to minimize the amount of time required to recover.

Disaster recovery is useful for large-scale failures or malicious activity such as ransomware, but is less useful for isolated, transactional issues related to human error. In this case, imagine a designer is editing geometries in PostGIS and commits erroneous data or deletes data in a way that cannot be easily undone. Recovering data for a single table from a full disaster recovery backup can be cumbersome if not impossible.

To mitigate this issue, a smaller-scale backup strategy focused only on key design, maintenance, and construction data sets was needed. Additionally, it was desired that the solution be spatially aware to provide additional validation while restoring as-built geometries. In the resulting solution, PostGIS tables are exported nightly to GeoParquet files in S3 buckets. A rolling seven-day backup of each table is maintained and a user can restore the most recent backup in the event of user error with the live table.

Why GeoParquet?

GeoParquet is an open, efficient, and cloud-native format for storing geospatial vector data using the Parquet format. It extends the capabilities of standard Parquet files by adding metadata that defines how geometry data is represented, including coordinate reference systems and geometry encoding. This allows for interoperability between geospatial tools and data processing frameworks, enabling faster querying, better compression, and scalable analytics on large spatial datasets.

Backup isn’t a use case that is typically discussed with GeoParquet, but is an ideal format for this kind of cloud backup of transactional geospatial tables. GeoParquet offers several advantages over traditional geospatial formats like GeoPackage and vector tiles, particularly in the context of modern data processing and cloud-native workflows. Unlike GeoPackage, which is based on SQLite and optimized for single-file use on local systems, GeoParquet is designed for distributed computing and can be efficiently read in parallel by big data engines. It also supports better compression and faster query performance due to its columnar structure. Compared to vector tiles, which are optimized for visualization and typically require pre-processing and tiling, GeoParquet maintains the full fidelity of the original geometry and attribute data, making it more suitable for analysis and data science workflows.

For purposes of backup, the superior compression of GeoParquet was attractive. Parquet is a columnar storage format – meaning it is stored column-first and opposed to traditional databases such as PostgreSQL or SQL Server, which take a row-first approach by default. (Though indexing in traditional databases tends to be column-based.) Data within a column tends to be much more homogeneous. Take ZIP codes for example – there are 41,683 unique zip codes, whereas there are about 163.1 million addresses. An address would be a row and ZIP codes would be a column.

GeoParquet can use run-length encoding, among other compression techniques, to replace repetitive values with a single token and a token count. The ZIP code column will contain far more repetition — reusing only 41,683 values across 163.1 million addresses — making it highly compressible. So it can encode (compress) the column much more efficiently. When we consider other columns such as state abbreviations, city names, and street types, we can begin to see how a column orientation leads to better compression.

Implementation Approach

Using Node.js and Express, an API was built that performs nightly backups of PostGIS tables to GeoParquet. It is called nightly for each database and the relevant tables exported to GeoParquet using OGR bindings and uploaded to S3. The nightly backups are triggered AWS EventBridge, a service that runs scheduled jobs in a variety of scenarios.

On the restore side, a second API was built using the same set of tools. This API enables the restore of the most recent version of a single table. The API pulls the previous day’s GeoParquet backup, truncates the PostGIS table and reloads the table with the data from the backup.

There has been an explosion of efficient geospatial formats over the last decade, ranging from vector tiles and PMTiles to cloud-native formats like GeoParquet, GOG, and zarr. While these formats enable web and cloud-native use cases, they can also play a role in augmenting traditional workflows while positioning organizations for transition to more cloud-centric approaches. This use case is an example. While providing a backstop for geospatial data management tasks, it also places key data in the cloud, ready to use in future applications.

Header Image: By NYITbears – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=44509314