Data Is Hard

Where I work, we have developed a nuanced philosophy to describe the niceties of collecting data, managing it, validating it, and preparing it for use: “Data is hard.”

This was brought to light in a very public manner by the vandalism that was displayed on basemaps produced by Mapbox. The responses by Mapbox  and their CEO, Eric Gendersen, are good examples of how a company should respond to such incidents. Kudos to him and the team at Mapbox for addressing and rectifying the situation quickly.

The Gordian Knot
By jmerelo [CC BY 2.0 (], via Wikimedia Commons
Speculation quickly ran to vandalism of OSM, which is one of the primary data sources used by Mapbox in their products. That speculation was backed up by the edit history in the New York area, but it is interesting to note that the vandalism was caught early in OSM and never came to light is OSM itself. In this case, the crowd worked as it was supposed to.

Because OSM is one of many data sources used by Mapbox, they pull the data at regular intervals and use automated means to perform QA/QC including, presumably, flagging and removal of potentially offensive content. According to them, the automated system correctly flagged the vandalism for review, but there seems to have been a failure in the system that tasked people to review the content. So, the vandalism made it through, leading to an inescapable conclusion: data is hard.

At my company, we support native OSM tiles as a backdrop in our software products, so we quickly inspected the New York area to ensure we weren’t displaying the offending content. We also do our own data collection for customers and sometimes use OSM as a reference (though OSM content is never used in our own data products). After a thorough look through the data itself, we realized we were clear on all counts.

The reality that data is hard shouldn’t be a surprise. As much as we like to fool ourselves that computers are a purely logical, engineered solution, everything about them, source code, UI/UX design, and, yes, data, reflects the biases and assumptions of the humans that created them. Computers and their behavior are as reflective of us, good and bad, as any poem or painting.

So the fact that what seems to have worked best is a crowd of humans poring over the data also isn’t surprising. We’ve still got some time before human-curated data becomes unnecessary, if that ever happens.

What happened with Mapbox is a wake-up call to everyone who makes downstream use of OSM or any other data source. Every time a data source is forked, every time one is fused with another, every step that is taken farther away from a data source’s native collection and quality control processes becomes a potential insertion point of risk and vulnerability. This is a fact — not unique to open data  — but it is not something that should deter anyone from using data such as OSM or from attempting to make data a more accurate representation of our world. So, kudos to the OSM community and to Mapbox for closing ranks quickly to protect a valuable data source. Data is hard and it requires vigilance.