Plausibility Is Not Provenance

Open a dataset of building footprints for a mid-sized city in sub-Saharan Africa. The polygons are clean. They snap to grid, close correctly, and sit at plausible addresses. Run them through any geometry validator and they pass. Load them in QGIS and they look, to every reasonable inspection, like a map. Some of them may be wrong in ways that are hard to see.

Not wrong in the way that hand-digitized data is wrong, where a rushed volunteer draws a rough approximation and you can see the imprecision at the edges. Wrong in the way that a confident model is wrong: internally consistent, spatially coherent, and disconnected from what is actually on the ground. A building that was demolished three years ago. A footprint that conflates two structures into one. A polygon placed with sub-meter precision on a plot that has never been built on.

This is one of the distinctive problems of AI-derived geospatial data, and it differs from many of the data quality problems the field is used to handling. Traditional errors in geographic datasets tend to be visible. A mislabeled road class, an offset coastline, a gap in a polygon ring: these announce themselves. The errors introduced by deep learning models trained on satellite imagery do not. They are geometrically valid. They are topologically sound. They look, in every formal sense, correct.

The volume of such data is now substantial. Microsoft’s Global ML Building Footprints dataset contains over 1.4 billion AI-derived building footprints covering much of the planet (Microsoft, 2022). Google’s Open Buildings dataset covers comparable territory across Africa, Asia, and Latin America (Google Research, 2023). These products are genuinely useful, and their developers are transparent about how they were produced. The problem is not that the data exists. The problem is what happens to it next.

Once AI-derived features enter downstream workflows, their origin tends to disappear. Fila et al. (2025) found that the identification of AI-generated buildings within OpenStreetMap remains challenging precisely because provenance tags are inconsistently applied, even when data providers recommend them. The same features that were flagged as model outputs at the point of creation arrive in research datasets, planning tools, and commercial products as plain geometry, indistinguishable from features that were field-verified or carefully hand-digitized.

This matters for a reason that the geospatial field has not yet fully confronted. In natural language processing, the last several years produced a substantial body of research on what happens when AI-generated outputs re-enter the training pipelines of subsequent models. The findings are not reassuring. Shumailov et al. (2024) demonstrated that training generative models recursively on their own outputs produces irreversible degradation: the tails of the original data distribution disappear, errors compound rather than cancel, and the resulting model becomes confidently, systematically wrong in ways that are difficult to detect because the outputs remain superficially plausible. They called this process model collapse.

Geospatial foundation models are earlier on this curve than their counterparts in language. Clay, Prithvi, SatCLIP (Klemmer et al., 2025), and a growing cohort of Earth observation models are being trained on satellite imagery at planetary scale, and many of the label sources feeding those training pipelines are themselves model outputs. The question of whether geospatial AI is already accumulating the conditions for its own version of model collapse has not been seriously asked. This article argues it should be.

Lessons From NLP on Model Collapse

To understand why the geospatial community should be paying attention, it helps to understand what the NLP community discovered, and when.

For much of the early 2020s, the dominant anxiety about large language models concerned what went into their training data: copyrighted text, private information, demographic bias baked into internet corpora. These were legitimate concerns, and they attracted serious research attention. A quieter problem was developing in parallel, one that was structural rather than ethical, and that would only become legible once AI-generated text began appearing at scale across the web.

The problem was feedback. As large language models became capable of producing fluent, plausible text, that text began circulating online. It was scraped, indexed, and eventually ingested by the next generation of models as training data. The models were, in effect, beginning to learn from themselves.

Shumailov et al. (2024) gave this process a name and a rigorous theoretical account. Working across variational autoencoders, Gaussian mixture models, and large language models, they demonstrated that training generative models recursively on synthetic data produces what they termed model collapse: a systematic, irreversible degradation in which the tails of the original content distribution disappear. The failure mode is not uniform decay. The model does not get worse at everything proportionally. Instead, it loses the edges of its knowledge first. Rare phenomena, underrepresented categories, distributional outliers: these are the first casualties. What remains is a model that has become very confident about a narrower and narrower slice of reality, while remaining superficially capable on the kinds of outputs that look normal.

The mechanism is worth dwelling on, because it is counterintuitive. Each generation of training on synthetic data introduces a small statistical approximation error. That error is not corrected in subsequent iterations; it is compounded. The model in generation two treats the slightly-narrowed outputs of generation one as ground truth, and narrows further. By generation three or four, the tails of the distribution have not merely shrunk; they have been forgotten. The model has no basis on which to recover them, because the data that described them no longer exists in its training set.

It is worth acknowledging a distinction a careful reader may raise. The model collapse literature was developed in response to a specific and massive phenomenon: AI-generated text was beginning to circulate at web scale, creating the possibility that model outputs would be scraped into later training corpora. No equivalent flood of AI-generated geodata exists. The conditions that originally motivated concern about recursive training in NLP are more acute there than in geospatial AI, at least in absolute terms. But the mathematical dynamics the research identifies turn not on absolute volume but on the proportion of synthetic data within a given training corpus. In parts of the world where field-verified geospatial data is scarce and AI-derived products represent the majority of available labeled examples, that proportion can be high enough to matter, regardless of how the global totals compare.

Seddik et al. (2024) extended the analysis statistically and arrived at a finding with direct practical implications: model collapse cannot be avoided when a model trains solely on synthetic data, but it can potentially be avoided when real and synthetic data are mixed, provided the proportion of synthetic data stays below a threshold. They do not offer a universal number, because the threshold depends on the task, the model architecture, and the characteristics of the original data distribution. But the implication is clear: the synthetic fraction matters, and exceeding it has consequences that accumulate silently across training generations.

Two aspects of this research program are particularly relevant to what follows. The first is that the degradation is not easily detectable from the outside. A collapsed model does not produce outputs that look broken. It produces outputs that look normal, because normality is precisely what it has been optimized to reproduce. The signal that something has gone wrong is the absence of the unusual, the rare, the edge case: and those are exactly the things that routine quality evaluation tends not to test for.

The second is that the damage is described as irreversible. Shumailov et al. (2024) are explicit on this point. Once the tails of the distribution have been trained away, there is no gradient descent path back to them. The only remedy is access to real data that was never contaminated by the feedback loop, and the ability to identify which parts of the training corpus that real data represents.

Both of these properties, the invisibility of the degradation and the irreversability of the damage, have obvious and troubling analogues in geospatial AI. The next section examines why.

An Emerging Feedback Loop

The NLP model collapse literature identifies a sequence of conditions that, when present together, create the feedback loop that produces degradation. AI-generated outputs circulate widely and are treated, implicitly or explicitly, as reliable data. They re-enter training pipelines for subsequent models, often without any flag indicating their origin. The proportion of synthetic data in those pipelines climbs, unmonitored, until the dynamics described by Shumailov et al. (2024) begin to take hold.

There is an important distinction to make at the outset. The primary risk is not that geospatial foundation models are being pretrained on synthetic maps. Most leading Earth observation foundation models are pretrained on observational imagery from satellites and other sensors. The vulnerability enters later, when those models are fine-tuned or evaluated for specific tasks using labels that may themselves have been produced by earlier models. In other words, the feedback loop is less about the imagery and more about the labels attached to it.

Each of these conditions has a geospatial counterpart. They are worth examining in turn.

The first condition is wide circulation. AI-derived geospatial datasets are now among the most comprehensive and widely used sources of spatial data available. Microsoft’s Global ML Building Footprints covers over one billion structures across dozens of countries (Microsoft, 2022). Google’s Open Buildings dataset extends comparable coverage across Africa, Asia, and Latin America. These are not niche research products. They power commercial applications, inform humanitarian response, and seed academic research. Their wide circulation is precisely their value proposition.

The second condition is implicit treatment as ground truth. This is more subtle, but it follows naturally from the first. When a dataset is comprehensive, globally consistent, and freely available, it becomes a default. Researchers building training pipelines for new segmentation or classification models reach for these products because they exist, because they are large enough to be useful, and because no comparable field-verified alternative exists at the same scale. The AI-derived origin is known in principle but rarely tracked in practice. Fila et al. (2025) found that even within OpenStreetMap, where the community has strong norms around data sourcing, provenance tags for AI-generated buildings are inconsistently applied, making it difficult to identify which features originated from model outputs. If provenance is poorly maintained in a community specifically oriented around data quality and transparency, it is reasonable to expect it to be maintained even less rigorously in downstream research and commercial datasets.

The third condition is re-entry into training pipelines. Large-scale conflation products that blend community-mapped data with AI-derived footprints can create this condition, not through any failure of design but as a consequence of what global geospatial infrastructure requires. The Overture Maps Foundation, whose buildings dataset powers Bing Maps and Esri’s ArcGIS Living Atlas among other major products, illustrates the point precisely because it is among the more transparent actors in this space. Overture documents its conflation order: OpenStreetMap takes priority, followed by Esri Community Maps, then Google Open Buildings at high precision, then Microsoft ML Building Footprints, then Google Open Buildings at lower precision (Overture Maps Foundation, 2024). The documentation also acknowledges directly that many Overture buildings are derived from ML sources. Overture is useful here not because it is unusually risky, but because its documentation makes visible a provenance problem that is easier to miss in less transparent datasets. The problem this transparency illuminates is not specific to Overture. Even where provenance is tracked at the point of production, it can become harder to recover at the point of consumption.

Many developers who build applications on top of conflation products, and researchers who use such products as label sources for training new models, encounter a unified dataset in which the boundary between observational and AI-derived geometry may be difficult to recover. This is not a criticism of conflation as a practice. Assembling the best available sources is exactly what responsible global data infrastructure should do. It is an observation about what can happen downstream. A building detector is built, and its outputs are released openly. A conflation layer aggregates those outputs with community data to improve coverage. A research team downloads the result as a label source because it is the most comprehensive thing available. That model’s outputs enter yet another downstream product. No one needs to misrepresent what the data is for the AI-derived fraction of the training inputs to become harder to identify. The structural problem is not who builds the conflation layer. It is the absence of a standard way to track the synthetic fraction as data moves downstream.

The data-sparse regions argument sharpens this picture considerably. Prithvi-EO-2.0 is pretrained on approximately 4.2 million samples from NASA’s Harmonized Landsat and Sentinel-2 dataset, covering more than 800 ecoregions (IBM Research, 2024). Clay is pretrained on roughly 70 million satellite image chips sampled globally from Sentinel-2, Landsat, and synthetic aperture radar sources (Development Seed, 2024). The issue is not those pretraining corpora. It is what happens when these models are adapted for specific downstream tasks, including building segmentation, land cover classification, and flood mapping. At that stage, the fine-tuning labels are typically drawn from existing labeled datasets: BigEarthNet (Sumbul et al., 2019), fMoW (Christie et al., 2018), and SpaceNet (Van Etten et al., 2018), among others. In well-mapped, data-rich regions, those labels are often observationally derived. In data-sparse regions, available labels may rely more heavily on AI-derived products, because little else exists at the required scale.

The practical consequence is geographic unevenness in risk. A foundation model fine-tuned for building detection in Western Europe or North America may be trained predominantly on field-verified or carefully hand-digitized labels. The same model fine-tuned for deployment in Central Africa or Southeast Asia is likely drawing on a much higher proportion of AI-derived labels simply because of what is available. The replace-versus-accumulate distinction from the model collapse literature is relevant here: if each successive model trained for data-sparse regions relies more heavily on the outputs of the previous generation because observational labels are scarce and AI-derived ones are abundant, the region-specific training pipeline approaches the replace scenario faster than the global aggregate suggests.

What makes this particularly difficult to address is that it is not yet being measured in any routine or standardized way. The fraction of AI-derived labels in any given geospatial foundation model fine-tuning dataset is not a quantity that is routinely disclosed or tracked. Zhu et al. (2026), in their framework for what an ideal Earth foundation model should look like, explicitly call for adherence to FAIR data principles in GeoFM pretraining pipelines, specifically to ensure that heterogeneous geospatial datasets can be consistently traced. But FAIR principles address findability and accessibility more than provenance: knowing where a dataset came from is not the same as knowing how much of it was itself a model output. The geospatial field has not yet developed the vocabulary, let alone the tooling, to answer that second question. The next section examines why that gap is not specific to geospatial AI, and what the NLP community’s struggle to address analogous problems suggests about how hard it will be to close.

Three Failure Modes to Watch

The immediate danger is not a dramatic collapse visible from orbit. It is a quieter degradation in the systems used to validate, benchmark, and reuse model outputs. That is why the first signs are likely to appear not in obviously broken maps, but in evaluation practices that continue to report confidence after independence has been lost.

The model collapse dynamics described in Section 2 do not operate in isolation. They are enabled and amplified by specific structural weaknesses in how geospatial AI systems are built and evaluated. Three failure modes are particularly worth naming, because all three have clear precedents in the NLP literature, all three are visible in geospatial AI in recognizable form, and none of them is yet being systematically monitored.

Inflated benchmarks

The most immediate failure mode is benchmark contamination: the situation in which accuracy metrics overstate real-world performance because the evaluation data is not genuinely independent of the training data. In NLP, this problem became acute when large models began memorizing test sets that had leaked into pretraining corpora, producing benchmark scores that reflected retrieval rather than generalization. The geospatial version of this problem has a different mechanism but an equally well-documented effect.

Geographic data is spatially auto-correlated by nature. Nearby observations are more similar than distant ones: a pixel in a satellite image shares spectral characteristics with its neighbors, and a building in a dense urban neighborhood resembles the buildings beside it. When training and validation sets are constructed by random sampling rather than spatial partitioning, this auto-correlation creates a form of leakage. The model effectively memorizes local spatial patterns and is then evaluated on test samples drawn from the same local context, producing accuracy figures that would not survive deployment in a new geographic area.

The performance inflation from this source is substantial and documented. Kattenborn et al. (2022) demonstrated that randomly sampled holdout evaluation overestimated model performance by up to 28% compared to spatially blocked cross-validation, using a CNN-based tree species segmentation task across spatially distributed drone image acquisitions. The same dynamic has been demonstrated across remote sensing tasks ranging from land cover classification to biomass estimation. Ploton et al. (2020), examining a random forest predicting above-ground forest biomass, found that random splits suggested good predictive skill while spatial cross-validation suggested no predictive skill at all, a result the authors attributed directly to data leakage from spatially auto-correlated train-test splits.

This problem exists independently of AI-derived labels. But AI-derived labels amplify it in a specific way. When the same AI model’s outputs are used to produce both training labels and validation labels for a new downstream model, the auto-correlation structure of the original model’s errors is inherited by both sides of the split. A validation set drawn from the same AI-labeled source as the training data is not measuring generalization to ground truth; it is measuring consistency with a prior model’s systematic biases. The benchmark passes, but it is not testing what it claims to test.

Distribution shift that looks like accuracy

The second failure mode concerns what happens when a model encounters geographic contexts it has not genuinely learned. In NLP, large models trained predominantly on English-language internet text produced systematically degraded performance for other languages and dialects, but this was not always detectable from benchmark scores because the benchmarks themselves were drawn from the same distribution as the training data. The degradation only became visible in deployment.

Geospatial AI has an analogue that may be harder to detect than the NLP version. Satellite imagery is globally available, and models trained on it produce outputs everywhere, including regions where the training data was sparse. A land cover classification model trained predominantly on European or North American imagery produces geometrically valid, visually coherent outputs when applied to West African or Central Asian landscapes. Those outputs may be systematically wrong in ways that are invisible without field verification, because the model has learned to reproduce plausible-looking patterns rather than to identify what is actually present. The outputs do not look degraded. They look like maps.

The geospatial field has begun to recognize this problem in the context of geographic bias in training data. Zhu et al. (2026) explicitly identify geographic diversity as a prerequisite for Earth foundation models, noting that training data must be distributed across evenly sampled geographic regions to avoid data bias. But acknowledging the requirement is not the same as meeting it. The dominant benchmarks used to evaluate foundation model performance, SpaceNet (Van Etten et al., 2018), BigEarthNet (Sumbul et al., 2019), and fMoW (Christie et al., 2018), are geographically skewed toward well-imaged, data-rich regions. A model that performs well on these benchmarks may still have systematic blind spots in data-sparse areas, and those blind spots will not appear in the evaluation results.

The connection to model collapse is direct. If the labels used to fine-tune models in data-sparse regions are themselves AI-derived, and if those AI-derived labels inherit the distribution biases of the models that produced them, then successive generations of fine-tuning are not correcting for geographic blind spots; they are reinforcing them. The outputs remain coherent everywhere. The errors accumulate silently in the places where they are most consequential.

Evaluation debt

The third failure mode is the most diffuse and perhaps the most important. It concerns the structural tendency of a rapidly developing field to build benchmarks faster than it questions them, accumulating what might be called evaluation debt: a growing gap between what the benchmarks measure and what practitioners need to know.

NLP accumulated this debt visibly with GLUE. When Wang et al. (2018) introduced GLUE as a unified evaluation platform for natural language understanding, it was a genuine contribution that catalyzed multi-task learning research and gave the field a common frame of reference. It also became authoritative faster than it was understood. Models rapidly approached and then surpassed human performance on GLUE metrics, but subsequent analysis revealed that the apparent gains reflected exploitation of dataset-specific statistical artifacts rather than genuine linguistic understanding. SuperGLUE was introduced to address these limitations, and the cycle repeated. The benchmark was built, the benchmark was gamed, the benchmark was replaced, and the field spent years interpreting progress that was partly illusory.

Geospatial AI is currently in the benchmark-building phase of this cycle. SpaceNet, BigEarthNet, fMoW, and a growing range of successor datasets have become de facto standards for evaluating foundation model performance in Earth observation. They are used extensively, they are well-engineered, and they are not being systematically interrogated for the kinds of weaknesses that the NLP field found in GLUE only after significant investment had been made in optimizing for them. The spatial auto-correlation problem described above is one such weakness. Geographic skew is another. The presence of AI-derived labels in both training and validation splits may be a third.

What makes evaluation debt particularly costly in geospatial AI is the application context. Benchmarks that overstate accuracy in flood mapping, building detection, or land cover classification are not merely academically misleading: they inform decisions about where to deploy models in disaster response, urban planning, and climate monitoring. The gap between benchmark performance and real-world performance is not an abstract methodological concern. It is the distance between what a system claims to know about the world and what is actually true.

Geospatial Provenance in the AI Era

When NLP researchers began grappling with the transparency problems created by large models trained on opaque data, the tools they reached for were new and improvised. Mitchell et al. (2019) proposed model cards as short documents accompanying trained models that disclose their intended use, performance characteristics, and evaluation conditions across relevant demographic and geographic groups. Gebru et al. (2021) proposed datasheets for datasets, a parallel framework for documenting how training data was collected, what it contains, who it was designed for, and what its limitations are. Neither framework was mandatory. Neither had enforcement mechanisms. Both were largely voluntary disclosures that required goodwill and institutional pressure to produce.

They were also, for all their imperfections, more than nothing. They introduced a shared vocabulary for describing what a model or dataset is and is not, and they normalized the expectation that this vocabulary should be applied before a model is deployed rather than after its failures become visible. The geospatial AI community, by contrast, is deploying models into high-stakes applications without an equivalent vocabulary, and the absence is not for lack of prior work.

The geospatial field has had lineage standards for decades. ISO 19115-1:2014, the international standard for geographic information metadata, includes explicit lineage elements: structured fields for recording the sources from which a dataset was derived and the process steps by which it was produced. These are not afterthoughts. Lineage has been a prominent element in the ISO metadata standard since its early versions, reflecting a long-standing community recognition that knowing where geographic data comes from is essential to assessing its fitness for use. The standard also has a well-developed ecosystem of profiles, implementations, and cataloging tools.

The problem is that this infrastructure was designed for a different problem. ISO 19115 lineage describes processing history at the dataset level. It records, in structured or free-form text, that a raster product was derived from a particular satellite acquisition, processed through a particular atmospheric correction algorithm, and resampled to a particular resolution. It was built for a world in which geographic data was produced by human analysts, photogrammetric workflows, and physical sensors, and in which the primary transparency question was: what was done to the raw data to produce this product?

That question is not the same as: what fraction of this dataset’s features are themselves the outputs of a prior model? ISO 19115 does not have a field for synthetic fraction. It does not distinguish, at the feature level, between a building footprint that was hand-digitized by a local mapper and one that was inferred by a deep learning model trained on satellite imagery from a different continent. Closa et al. (2019), reviewing the state of geospatial provenance metadata, noted explicitly that there are still gaps in the description of provenance metadata that prevent the capture of comprehensive provenance useful for reuse and reproducibility. They were writing about gaps in the existing standard; the gaps created by AI-derived features did not yet have the scale to be a central concern.

They do now. The OGC November 2024 Metadata Code Sprint, which brought together implementers to work on the next generation of geospatial metadata standards including a planned JSON encoding for ISO 19115, explicitly discussed using the W3C PROV vocabulary for encoding lineage, noting that the current free-form text approach for lineage within DCAT contexts is unsatisfactory for automated use (OGC, 2025). This is meaningful progress. It also illustrates precisely the gap: the standards community is still working at the level of making lineage machine-readable and interoperable, which is a necessary precondition for addressing the synthetic fraction problem but is not itself a solution to it.

The model card and datasheet frameworks from NLP, imperfect as they are, addressed a specific and narrow question: before someone uses this model or this dataset, what do they need to know about its limitations? That question, applied to a geospatial foundation model fine-tuning dataset, would require disclosures that current standards do not support. What fraction of the training labels in this dataset are AI-derived? From which models? Trained on which geographies? With what known failure modes? What fraction of the validation labels share the same AI-derived provenance as the training labels? Without answers to these questions, the benchmarks described in Section 4 cannot be properly interpreted, and the model collapse dynamics described in Section 2 cannot be detected until their effects are already embedded in deployed systems.

There is a further structural problem that standards alone cannot solve. Even if ISO 19115 were amended tomorrow to include a mandatory synthetic fraction field, compliance would depend on the willingness of dataset producers to populate it accurately, and on the willingness of downstream consumers to propagate it through conflation and aggregation pipelines. Fila et al. (2025) found that provenance tags for AI-generated buildings in OpenStreetMap are inconsistently applied even when data providers explicitly recommend them. If voluntary tagging fails at the level of individual features in a community with strong data quality norms, mandatory fields in a metadata standard are unlikely to behave differently in the broader ecosystem of research datasets and commercial products where the norms are weaker and the compliance incentives are lower.

The honest summary of the situation is this: the geospatial community has more provenance infrastructure than it is given credit for, and less than it needs. ISO 19115 lineage is a serious and useful tool for the problem it was designed to solve. It was not designed to solve the problem of tracking AI-derived labels through multi-generation training pipelines. No standard currently exists that was. The gap is not simply technical: it is conceptual. The field does not yet have an agreed vocabulary for describing the synthetic composition of a training dataset or the AI-derived fraction of a benchmark, which means it does not yet have a basis for the kind of disclosure norms that, in NLP, eventually produced model cards and datasheets. Those norms emerged slowly, under pressure from researchers who named the problem clearly enough that the community had to respond. The final section argues that naming the problem clearly is also the most productive thing geospatial AI can do right now.

What the Field Can Do Now

The conversation about responsible geospatial foundation model development has begun. A Nature Machine Intelligence editorial published in August 2025 called for attention to the sustainable development of GeoFMs, noting challenges including resource efficiency and privacy in the collection of high-resolution imagery (Nature Machine Intelligence, 2025). The PANGAEA benchmark introduced a standardized evaluation protocol spanning diverse geographic regions, sensor types, and task domains, explicitly addressing the narrow and inconsistent evaluation practices that had characterized the field to that point (Marsocci et al., 2024). The OGC November 2024 Metadata Code Sprint progressed work toward machine-readable lineage encoding (OGC, 2025). These are real steps.

They do not, however, address the synthetic fraction problem, the recursive training pipeline risk, or the provenance gap at the fine-tuning stage. The responsible AI discourse in geospatial AI is currently oriented around computational cost, privacy, and geographic diversity in benchmarks. The NLP community did not have the benefit of a prior warning: the model collapse literature emerged from observing a phenomenon already happening at scale before anyone had named the dynamics it would produce. Shumailov et al. (2024) published in Nature in July 2024, by which point the feedback loop was already established. The geospatial community is reading this literature before the equivalent saturation point has been reached in its own domain. That window should not be squandered.

Three concrete steps follow from the analysis in the preceding sections. First, papers introducing or fine-tuning geospatial foundation models should disclose the estimated fraction of AI-derived labels in their training and validation datasets. This is a norm change requiring no new standards or tooling, only an expectation from journals and conference program committees that already require computational cost disclosures. Second, spatial cross-validation splits should become a standard alongside random splits for benchmark dataset publication. Kattenborn et al. (2022) documented a performance gap of up to 28 percentage points between the two approaches: publishing benchmarks with only random splits systematically overstates real-world performance in a way that is not hypothetical but measured. Third, the OGC’s current work on ISO 19115-4 JSON encoding, already oriented toward machine-readable lineage using W3C PROV vocabulary, is a natural point at which to introduce a synthetic origin flag at the feature level. The specific requirement is modest: a propagatable attribute indicating whether a feature was derived from model inference rather than field observation or photogrammetric digitisation. This would not resolve the provenance gap in existing datasets, but it would establish the vocabulary that future benchmark producers and model card authors need.

None of these steps requires certainty that geospatial model collapse is already underway. They are warranted by the existence of documented mechanisms and the absence of monitoring.

The risk is not that future maps will look obviously broken. The more serious risk is that they will look right. They will validate cleanly, render convincingly, and agree with the models that came before them, while becoming harder to connect back to observation. That is the lesson geospatial AI should take from NLP’s mistakes: plausibility is not provenance, and coherence is not truth.


References

Christie, G., Fendley, N., Wilson, J., & Mukherjee, R. (2018). Functional Map of the World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6172–6180). IEEE. https://doi.org/10.1109/CVPR.2018.00646

Closa, G., Masó, J., Zabala, A., Pesquer, L., & Pons, X. (2019). A provenance metadata model integrating ISO geospatial lineage and the OGC WPS: Conceptual model and implementation. Transactions in GIS, 23(5), 1102–1124. https://doi.org/10.1111/tgis.12555

Development Seed. (2024). An open foundation model for Earth. Development Seed. https://developmentseed.org/projects/clay/

Fila, M., Štampach, R., & Herfort, B. (2025). AI-generated buildings in OpenStreetMap: frequency of use and differences from non-AI-generated buildings. International Journal of Digital Earth, 18(1), Article 2473637. https://doi.org/10.1080/17538947.2025.2473637

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé, H., III, & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

IBM Research. (2024). IBM and NASA release a new version of Prithvi. IBM Research Blog. https://research.ibm.com/blog/prithvi2-geospatial

ISO. (2014). Geographic information — Metadata — Part 1: Fundamentals (ISO 19115-1:2014). International Organization for Standardization. https://www.iso.org/standard/53798.html

Kattenborn et al. (2022). Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks. ISPRS Open Journal of Photogrammetry and Remote Sensing, 5, Article 100018. https://doi.org/10.1016/j.ophoto.2022.100018

Marsocci, V., Jia, Y., Le Bellier, G., Kerekes, D., Zeng, L., Hafner, S., Gerard, S., Brune, E., Yadav, R., Shibli, A., Fang, H., Ban, Y., Vergauwen, M., Audebert, N., & Nascetti, A. (2024). PANGAEA: A global and inclusive benchmark for geospatial foundation models (arXiv:2412.04204). arXiv. https://arxiv.org/abs/2412.04204

Microsoft. (2022). Global ML building footprints [Dataset]. GitHub. https://github.com/microsoft/GlobalMLBuildingFootprints

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220–229). ACM. https://doi.org/10.1145/3287560.3287596

Nature Machine Intelligence. (2025). Towards responsible geospatial foundation models. Nature Machine Intelligence, 7, 1189. https://doi.org/10.1038/s42256-025-01106-7

OGC. (2025). November 2024 metadata code sprint summary report (OGC Document 24-063). Open Geospatial Consortium. https://docs.ogc.org/dp/24-063.html

Overture Maps Foundation. (2024). Overture 2024-04-16-beta.0 release notes. Overture Maps Foundation. https://overturemaps.org/overture-2024-april-beta-release-notes/

Ploton, P., Mortier, F., Réjou-Méchain, M., Barbier, N., Picard, N., Rossi, V., Dormann, C., Cornu, G., Viennois, G., Bayol, N., Lyapustin, A., Gourlet-Fleury, S., & Pélissier, R. (2020). Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nature Communications, 11, Article 4540. https://doi.org/10.1038/s41467-020-18321-y

Seddik, M. E. A., Chen, S.-W., Hayou, S., Youssef, P., & Debbah, M. (2024). How bad is training on synthetic data? A statistical analysis of language model collapse (arXiv:2404.05090). arXiv. https://doi.org/10.48550/arXiv.2404.05090

Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755–759. https://doi.org/10.1038/s41586-024-07566-y

Sumbul, G., Charfuelan, M., Demir, B., & Markl, V. (2019). BigEarthNet: A large-scale benchmark archive for remote sensing image understanding. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (pp. 5901–5904). IEEE. https://doi.org/10.1109/IGARSS.2019.8900532

Van Etten, A., Lindenbaum, D., & Bacastow, T. M. (2018). SpaceNet: A remote sensing dataset and challenge series (arXiv:1807.01232). arXiv. https://arxiv.org/abs/1807.01232

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP (pp. 353–355). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5446

Google Research. (2023). Open Buildings V3 Polygons [Dataset]. Google Earth Engine Data Catalog. https://developers.google.com/earth-engine/datasets/catalog/GOOGLE_Research_open-buildings_v3_polygons

Klemmer, K., Rolf, E., Robinson, C., Mackey, L., & Rußwurm, M. (2025). SatCLIP: Global, general-purpose location embeddings with satellite imagery. Proceedings of the AAAI Conference on Artificial Intelligence, 39(4), 4347–4355. https://doi.org/10.1609/aaai.v39i4.32457

Zhu, X. X., Xiong, Z., Wang, Y., Stewart, A. J., Heidler, K., Wang, Y., Yuan, Z., Dujardin, T., Xu, Q., & Shi, Y. (2026). On the foundations of Earth foundation models. Communications Earth & Environment, 7, Article 103. https://doi.org/10.1038/s43247-025-03127-x

Header image: G. Edward Johnson, CC BY 4.0 https://creativecommons.org/licenses/by/4.0, via Wikimedia Commons