Metadata Rising

Earlier in my career, I was working on an infrastructure protection task and we were reconciling data from several sources that addressed the same road network. The data from the locality was authoritative, but it lacked some information we needed so we were conflating other data to the linework. I commented on the general lack of metadata and the government team lead said something like “We’ll never get this done if we wait for that. Overlay everything and use your experience to toss out the outliers. We’re shooting for consensus, not perfection.”

If you’ve spent any time around geospatial data, you’ve probably heard some version of “We really ought to write better metadata.” It’s one of those perennial truths in geospatial, right up there with the fact that someone, somewhere, is still using a shapefile from 2003. Most of us know metadata is important, but it tends to get pushed aside in favor of the more immediate work of getting maps made, services deployed, or analyses out the door. Metadata has always been the homework that no one is especially excited about.

That said, the landscape is shifting a bit. As large language models (LLMs) have begun working their way into geospatial workflows, the role of metadata is becoming less optional. LLMs don’t have the benefit of experience or intuition. They can’t infer context from the filename, or guess the intended use based on the folder it lives in. For the most part, they only know what we tell them.

LLMs can suffer from a well-documented behavior: they will produce answers with seeming confidence, even when the underlying information is thin or ambiguous. We call those answers hallucinations, but they are mostly just the statistical model trying to make sense of missing context. In geospatial work, that missing context is often provided in metadata: units that weren’t stated, datums that weren’t specified, process steps that were never written down. Humans can usually navigate those gaps because we recognize familiar patterns or quirks. Models can’t. The more complete and explicit the metadata, the less room there is for the model to improvise, and the more reliably it can stay grounded in what the data actually represents.

It’s easy to forget that much of our geospatial practice relies on unstated assumptions and institutional memory. We know where a dataset came from because we’ve been using it for a decade. We know the quirks of a local parcel layer because we’ve seen every update since the Great Attribute Renaming of 2014. LLMs have none of that. From their point of view, metadata isn’t an afterthought. It’s the only window they have into the world the data describes, or in some cases, the history or lineage of the data.

So while metadata has always mattered, it matters even more now. Not in an abstract “best-practices” way, but in a practical sense. If we want models to be helpful, we need to give them something to work with. That starts with documenting what we already know.

LLMs Love Metadata

As humans, we’re constantly filling in gaps without realizing it. If a dataset doesn’t explicitly state its coordinate system, we look at the numbers and make an educated guess. If a field name is cryptic, we lean on experience or muscle memory to interpret it. Most of the time, this works well enough because we’re drawing on context that’s built up over years of working with similar data.

A simplified look at an LLM-driven workflow with metadata context injection

LLMs don’t have that luxury. They don’t get to assume that values in the millions must be Web Mercator, or that a column called “ELEV” probably means height in meters. They can learn these things over multiple training and fine-tuning runs with sufficient data, but they lack the intuition that humans can bring to such problems. (“I saw a similar inconsistent offset before and it turned out to be an issue with the ellipsoid.”)

Instead, they work strictly from what we give them. If the CRS isn’t defined, they won’t infer it. If the vertical datum isn’t stated, they won’t know how to apply a conversion. If accuracy, resolution, or lineage are vague or missing, they won’t fill in the blanks the way a seasoned analyst might. An LLM will simply proceed with whatever information is present and that can lead to strange or incorrect conclusions (often called hallucinations), not because the model is wrong, but because the data didn’t tell it any better.

This is one of the subtle shifts LLMs bring to geospatial work. They don’t reward improvisation or tribal knowledge. They reward clarity. They work best when the information they need is spelled out plainly, with as few assumptions left hanging as possible. That’s not a weakness of the models. It’s a reminder that a lot of what we consider “obvious” is only obvious because we’ve been steeped in this work for a long time.

Existing Metadata Standards and LLM Workflows

The geospatial world hasn’t exactly suffered from a shortage of metadata standards. We’ve accumulated a range of them over the years. They all try to answer the same basic question: how do we describe what this data is and where it came from? But they do it in very different ways, and those differences matter a lot more once LLMs enter the picture.

FGDC CSDGM

The FGDC standard has been with us for decades, and it shows. It can be exhaustively thorough and leans heavily on long-form narrative descriptions. When it’s done well, FGDC metadata can capture the kinds of details that rarely make it into more modern schemas: how a survey team handled obstructions or what assumptions went into manual edits.

The flip side is that FGDC files can be inconsistent and difficult for machines to parse at scale. Two agencies can follow the standard faithfully and still produce very different-looking XML files. LLMs can extract value from the narrative, but only if that narrative is available and indexable.

ISO 19115 / 19139

ISO’s metadata standards can provide more structure and rigor. They organize information into a well-defined schema and cover a broad range of spatial, temporal, and lineage concepts. In theory, this should make them ideal for machine consumption.

In practice, ISO metadata often ends up deeply nested and verbose. Important details such as vertical datums, quality statements, or process steps may be present but buried several levels down. Machines can parse it, but they need help surfacing what actually matters for a workflow. LLMs can work with ISO metadata just fine, but it benefits from some preprocessing or summarization.

STAC and Cloud-Native Metadata

STAC has gained traction quickly because it feels like it belongs in the modern data ecosystem. It’s JSON-based, concise, and tailored to describing assets in cloud storage. Its extension model makes it easy to add fields for things like raster statistics or projection information without rewriting the standard.

For LLMs, STAC’s structure is a clear advantage. It exposes key concepts like extent, type, acquisition date, and links to assets in predictable places. The challenge is that STAC items vary widely in how completely they’re populated. Some include rich, well-structured metadata; others barely fill out the required fields.

GeoPackage, GeoParquet, and PostGIS

These formats and systems treat metadata as part of the data model itself. A GeoPackage or GeoParquet file can embed CRS definitions directly alongside the geometry in conformance with the OGC Simple Features specification. PostGIS goes a step further: schemas can include table comments, column comments, constraints, and enumerated types. All of that is metadata, even if we don’t always call it that.

Mermaid ER diagrams use markdown to help LLMs understand data structure

This kind of schema-level metadata is incredibly useful for LLMs because it’s close to the data and typically more consistent than narrative formats. Much of the metadata, such as data types, is an intrinsic part of the data model. A column named “height_ft” conveys some information, but a comment that says “building height at eave level” adds far more detail. The downside is that these elements often go unused. They’re easy to add, but they’re just as easy to forget.

Shapefile Metadata and the Sidecar Problem

Shapefiles have been with us forever, and they come with practically no built-in metadata support beyond the .prj file. In many shops, metadata may live in a sidecar FGDC XML file. When it’s present and maintained, that XML can be rich and detailed; full of the kinds of narrative context that LLMs can actually work with.

The problem is that sidecar metadata is loosely coupled at best. Files get renamed or moved; XML gets separated from the data it was meant to describe. An LLM won’t magically know that “roads_final.shp” goes with “roads_final_metadata.xml” unless we tell it or design our workflow to associate them.

Shapefiles aren’t going away anytime soon, but they’re a good reminder that metadata only helps if it travels with the data and stays intact over time.

The Case for Verbosity

For all the challenges that come with older, narrative-heavy metadata standards, they still offer some benefits in the era of LLMs. The long-form sections in FGDC or ISO records often capture nuances that don’t fit neatly into lighter-weight, schema-driven formats. Things like survey conditions, odd edge cases, sensor quirks, or the judgment calls a team made during manual cleanup tend to show up in those narratives. They’re the kinds of details that never seem to make it into a tidy set of fields.

That’s where verbose metadata can actually be an asset. LLMs are surprisingly good at pulling meaning out of prose. If a metadata record notes that a LiDAR return density dropped along a shoreline because of leaf-on conditions, a model can use that information to temper how it interprets elevation changes in that area (or at least pass the information along for clarity).

A lot of what we call “institutional knowledge” ends up embedded in these long narrative blocks. These details are rarely captured anywhere else. And while a human can sometimes intuit those quirks by looking at the data, an LLM needs them written down.

Verbose metadata also helps preserve history. Modern, lightweight formats like STAC or GeoParquet are great at describing what data is right now, but they often say less about how it got that way. Narrative metadata leaves a breadcrumb trail of process steps, decisions, and context that might not be required for day-to-day use but can matter a lot when you’re trying to reason about accuracy or validate unexpected results.

So while verbose metadata can be messy, it has its place. It fills in the gaps that structured fields may not cleanly address, and LLMs are well-suited to making use of that information when it is available.

Strengths and Limitations

All of these standards and approaches have advantages and disadvantages when LLMs are involved. FGDC and ISO offer depth and nuance. STAC, GeoParquet, and GeoPackage provide structure and predictability. PostGIS bakes metadata directly into the schema. Even the humble shapefile can carry valuable context if its sidecar files stay intact.

But each of them also has gaps, and those gaps start to matter a lot more once we lean on models that don’t fill in blanks the way humans do. Older standards tend to bury important information in dense narrative blocks. The details are there, but they’re not always surfaced in ways that are easy for a model or a workflow to use. Newer, lightweight formats can be almost too lean. They excel at describing what a dataset is but sometimes say less about how it got that way or what assumptions went into its creation.

Inconsistent terminology is another common issue. One dataset may describe its coordinate system with an EPSG code, one that uses proj4 strings, another with a local name, and yet another with a prose description that feels like it was copied from a decades-old project plan. Vertical datums can be missing, mislabeled, or buried in a process step. Accuracy may be described numerically (“Horizontal accuracy: RMSE = 0.42 m.”), vaguely (“Not to be used for targeting”), or not at all.

Humans work around such inconsistencies regularly. But LLMs won’t. They may not assume that “NAD83” in one file means the same thing as “North American Datum 1983” in another. (More advanced LLMs may actually sort that out but I wouldn’t rely on that as a matter of course.) They won’t recognize that “elev_ft” and “height_ft” are interchangeable without being told so. They won’t know that one dataset’s “Date_Created” is really the ingestion date and not the acquisition date.

So the strengths and weaknesses of each metadata format become more visible when LLMs are part of the workflow. The structured formats give models something solid to anchor on, while the verbose ones provide nuance that structured fields can’t always capture. The challenge is that few formats do both equally well.

Metadata as “Contract”

One of the more significant shifts as LLMs work their way into geospatial workflows is that metadata starts to look less like documentation and more like a contract. A lot of the automation we ask these systems to perform, from reprojections to multistep transformations, relies on knowing what operations are valid, what assumptions are safe, and where the boundaries are. Humans tend to navigate those decisions with a mix of habit, experience, and a feel for the data. LLMs, by contrast, usually only have the metadata and their context windows. Longer-term institutional knowledge can be elusive for them, though not completely out of reach.

If a raster is in meters and another is in feet, the metadata is what tells a model that a unit conversion is needed. If a vector layer carries features in a geographic coordinate system, the CRS is what lets a model know that calculating area or buffering is going to require a reprojection first. If a DEM uses NAVD88 and a coastline layer uses MLLW, it’s the vertical datum metadata that prevents a model from quietly mixing the two.

Lineage matters as well. If a dataset was smoothed, snapped, interpolated, generalized, or merged from multiple sources, those steps define what downstream operations make sense and which ones may produce misleading results. A human analyst might catch those issues from context: recognizing, for example, that a dataset has been aggressively generalized because they remember when it was done. An LLM will need metadata to understand that.

This is why clearer metadata pays dividends when models are orchestrating workflows. It gives guidance about what they should or should not do. It reduces the surface area for incorrect assumptions. And it helps ensure that, when an LLM chains together several steps, the sequence aligns with the actual characteristics of the data rather than an educated (or automated) guess.

In that sense, good metadata becomes part of the workflow itself and a key piece of the processing infrastructure. It defines the contract under which automation operates. And the more LLMs coordinate or assist with geospatial processing, the more important it becomes that the contract is complete and consistent.

Context Windows and Token Management

LLMs operate within fixed context windows, which define the maximum amount of text, measured in tokens, that a model can consider at one time. Tokens are fragments of words, and tokenization is model-specific. A single word like “building” might become one token in one model and two or three in another. Everything a model processes or produces counts toward the limit: the prompt, the metadata you pass in, any intermediate reasoning, and the output itself.

To put this in context, models like GPT‑4 Turbo support context windows on the order of 128,000 tokens or more, which is large but not limitless. Long XML blocks, dense process descriptions, or deeply nested ISO structures can consume thousands of tokens before the model even begins its actual work. Even generous context windows can only take in so much at once, and token usage has both performance and cost implications.

FGDC and ISO metadata can span thousands of lines, and even well‑structured ISO records can bury important details several layers deep. Humans can skim or jump to what they need but models don’t have that luxury. This is where token management starts to matter. A full FGDC record might contain valuable detail about survey conditions, sensor settings, or processing steps, but feeding it into a model as part of a workflow prompt isn’t always feasible. Even if it fits, you may not want to spend the token budget on it. LLMs tend to do better when the most essential information is made available directly in a compact, predictable format.

Longer narrative sections still have value, but they work better when treated as reference material rather than primary input. That usually means some combination of two approaches. First, distill the core structured metadata into a concise representation that’s easy for a model to consume. Second, keep the verbose metadata available through retrieval, so the model can pull in the relevant portions when a workflow or question genuinely depends on that level of detail.

One option to consider is generating “model‑friendly abstracts” alongside full metadata records. These wouldn’t replace the authoritative metadata, but could serve as compact summaries covering the decision points a model is most likely to need. It’s a lightweight way to keep nuance available without overwhelming the prompt.

Models don’t need all the metadata at once. They simply need the right metadata at the right time. Structured fields support fast, reliable reasoning, while narrative detail can be surfaced as needed. Managing the balance is what keeps LLM‑enabled workflows from becoming either brittle or overly expensive in terms of tokens.

Augmenting Metadata for LLMs

Because many geospatial metadata standards were written with human readers in mind, they tend not to be optimized for use in LLM-orchestrated workflows. LLMs work best when metadata is formatted with clarity, consistency, and structure. A little augmentation can go a long way.

Machine‑Readable Lineage

Narrative lineage is valuable, but it can also be difficult for a model to interpret reliably. A paragraph describing how a dataset was snapped, merged, and resampled contains nuance, but an ordered, explicit list of those steps can give a model a clearer sense of what actually happened. Ideally, both forms exist: the narrative for depth, and the structure for action.

Explicit Units, Datums, and Accuracy

Humans can often intuit units from context or recognize a datum from familiar coordinate ranges. Models will try to do the same, but the results may not be what we expect. Units should be stated plainly. Vertical and horizontal datums should be unambiguous, and CRS definitions should be provided in a form that doesn’t require external lookup. An EPSG code is useful, but a full OGC CRS‑WKT or proj4 string gives a model the explicit details it needs, such as datum, ellipsoid, axis order, and units, without relying on assumptions about what an EPSG code represents. Accuracy should be numerical when possible or at least specific enough to guide downstream decisions. These details help prevent LLMs from making incorrect assumptions about how to align or compare datasets.

OGC CRS-WKT provides a structure format to understand coordinate systems

Semantic Metadata

Field names like height_ft or temp_val only tell part of the story. Semantic metadata adds meaning that names alone can’t carry: whether a value represents a maximum, an average, an eave height, or a ground measurement. A brief description at the column level, such as a column comment in PostGIS, can help a model make better decisions and avoid overly literal interpretations.

Usage Intent

Often, metadata describes what a dataset is but not what it was for. Knowing that a dataset was built for flood modeling, routing, hydrology, or vegetation mapping tells a model a lot about how to handle it. Usage intent can also help a model avoid inappropriate use. For example, recognizing that a generalized boundary shouldn’t be used for parcel‑level analysis.

Lightweight Profiles for LLM Workflows

One way to bridge gaps is to define a small set of fields that are consistently present, machine‑readable, and easy to maintain. This doesn’t replace existing standards; it sits alongside them. A lightweight profile might include CRS-WKT, vertical datum, temporal coverage, resolution, geometry type, and a structured lineage list. It’s not meant to capture everything, just the subset most essential to automated reasoning.

None of these augmentations require a new standard or a wholesale reinvention of existing ones. They simply involve documenting a bit more of what we know about our data and lightly transforming it to forms that models can use reliably.

Wrapping Up

One of the interesting side effects of bringing LLMs into geospatial work is that they quietly change the incentives around metadata. Throughout my career, metadata has been something we knew we should do but often put off in favor of more immediate needs. LLMs don’t let us get away with that as easily. They surface the cost of missing context, unclear assumptions, or inconsistent terminology in ways that other tools never quite did.

This shift doesn’t mean we need to become rigid or bureaucratic about metadata. It simply means that documenting data clearly, consistently, and close to the source pays off more than it used to. The same habits that help human analysts understand a dataset help models understand it, too. And as more geospatial workflows blend traditional tools with LLM‑assisted steps, metadata starts to feel less like administrative overhead and more like operational infrastructure.

LLMs give us a practical reason to close the gap between what we know and what we document. That’s not a bad place to be.

Homework image credit: No machine-readable author provided. Fir0002 assumed (based on copyright claims)., GFDL 1.2 http://www.gnu.org/licenses/old-licenses/fdl-1.2.html, via Wikimedia Commons

Header image credit: Walter Reed Army Medical Center. Office Of The Quartermaster, Public domain, via Wikimedia Commons