Balancing Organizational Controls and Technical Controls in Data

Technical Controls – The security controls (i.e., safeguards or countermeasures) for an information system that are primarily implemented and executed by the information system through mechanisms contained in the hardware, software, or firmware components of the system.

Organizational Controls – The security controls (i.e., safeguards or countermeasures) for an information system that primarily are implemented and executed by people (as opposed to systems).

NIST-800

The definitions above come from the glossary of the NIST-800 series of cybersecurity publications. While they are focused on cybersecurity, the broader concepts – automated controls versus manual controls – are applicable elsewhere. Over the last couple of weeks, and especially since I attended the TUgis conference, I have been thinking about these concepts in terms of data in general and schema in particular.

I find schema to be an interesting concept. The term “schema” is fairly wide-ranging in its definition but it can be defined as “an underlying organizational pattern or structure; conceptual framework.”

This, along with the definitions of technical and organizational controls, is important because, in the realm of data and databases, “schema” has become conflated with a specific set of technical controls most often associated with relational databases (RDBMS). As a result, non-relational, document-based, “NoSQL” databases, data lakes, and similar implementations are often referred to as “schemaless.” This is, of course, not possible because schema is a conceptual framework that brings order to the data environment so that we, humans, can understand the objects and information represented by the data. If we have a conceptual framework, we have a schema. 

IBM offers a good discussion of schema, breaking it down into conceptual, logical, physical schema. This comes directly from database theory and is not original to IBM. When a platform is described as “schemaless,” it usually refers to physical schema as typically implemented in an RDBMS – tables, indices, view, columns, data type, primary and foreign keys, triggers, and other physical structures implemented to enforce data quality, referential integrity, and generally make the physical schema self-sustaining.

Physical schema can be thought of as technical controls for data. It is the collection of safeguards and constraints implemented and executed by the system through mechanisms contained in the database.

Conceptual schema and logical schema are the organizational controls that define the rules and framework for the data environment. They are the safeguards and constraints implemented and executed by the people who own, use, and manage the database.

The balance between schema and schemaless is essentially the level of technical control implemented by the database to enforce the conceptual and logical schemata. Modern RDBMS’s have implemented document-like features that enable minimal physical schema definition. One can also enforce most data rules within application code, so the concept is mostly independent of any particular database platform.

Why does any of this matter? Because when non-technical users hear “schemaless,” they don’t usually understand that it only refers to physical schema. Freed from any form of schema based on this misunderstanding, they can create a data lake or some other structure that cannot be queried or analyzed efficiently. An overarching conceptual framework is still key to modeling information in a way that makes it useful. In my experience, the conceptual and logical schemata are always more important than the physical schema.

I started thinking about this again after seeing a presentation about the efforts of Prince George’s County, Maryland to import its data into OpenStreetMap (OSM). The county has a well-defined physical schema in its data, but it took several passess to get the data prepped for import into OSM. OSM uses a key-value-pair (KVP) approach to its data structure and generally has fewer technical controls (though it does have them) than a typical RDBMS.

This makes sense for OSM as it is a worldwide geographic data set and enforcement of a rigid physical schema would most likely cause it to lose cultural richness. Think in terms of store/bodega/mahala, for example.  All regions have the concept of a store, but each region has a different set of characteristics it applies to its variety of store. OSM’s approach allows for this flavor to be preserved even if the data may need some cleaning for any single use case. I think it mostly gets it right.

The experience of Prince George’s County with OSM is a great example of organizational controls. OSM is a fairly loose organization. Its approach is mostly convention over configuration or, as one would say today, “read the room.” Look at how we do things and emulate that. It took the county a few tries to get it right, but they eventually got their data into OSM and now have a pretty good process for pushing updates.

To me, the experience of Prince George’s County drove home that all data has schema. Data is always a model of the real world and models always have a conceptual framework, or schema. The real balance isn’t between “schema” or “schemaless.” It is between the right mix of organizational and technical controls needed to keep data organized and useful for the humans who are trying derive value from it.