More often that not in my current role, opportunities to get my hands dirty come from the data side of our operation rather than the engineering side. These days, the data side involves corporate data rather than a lot of geospatial data. If I were to be guided by my own personal inertia, I’d drift toward traditional geospatial data 99% of the time, but working with other data stores and building pipelines involving them is good exposure.
Most recently, I’ve been working a lot with Salesforce and other data sources to support customer success operations. Customer success, as a discipline, is relatively new, having grown out of the SaaS market from the best practices involved in nurturing and growing customers post-sale as part of the SaaS land-and-expand model.
SaaS typically begets SaaS – meaning that you won’t often find SaaS companies using on-prem versions of business tools. This presents interesting challenges for data integration and analytics. Fifteen years ago, there’d most likely be a server room with installs of various vertical systems that had been configured to use whatever the organization’s blessed database platform was. In the event that an individual system didn’t support that database, there might be some nominal ETL performing a one-way sync so that the necessary charts and graphs could be made as needed.
In the SaaS world, everything is hosted by someone else, subject to terms of service, SLAs, and the like. That’s generally a good thing in terms of keeping systems up to date, but can present regulatory challenges in the current environment of personal data breaches. Trade-offs…
Because the back end is hidden, SaaS platforms often offer a dizzying array of official integrations with other SaaS platforms. The combinations and permutations of integrations to wire up subscription management and CRM and provisioning and marketing tech and ERP are nearly infinite. Shielded wires connecting black boxes.
That’s great until you want to look at the data from these various systems side-by-side, in one place. You can opt for a fragile daisy chain of API integrations or you can take some control over the process. I chose to do the latter.
Without getting into the details of our entire tech stack, Salesforce figures prominently for us, as it does for many other organizations. I haven’t been able to find any reliable numbers, but Salesforce has quietly become a fairly large repository of data about people, their behaviors, and their interactions with products. As my colleague, Coleman, put it:
Extended by marketing technologies (“martech“) such as Marketo or Pendo, that behavior can be characterized well before a person fills out an inquiry form and identifies themselves. There’s no personal information at that stage, so it’s pretty much “a person spent a lot of time reading about this product feature.” Later, when that person fills out a form, that data is linked up so the sales team already has a picture of what specific features may be of interest.
But back to data integration. This data tends to be stovepiped between SaaS CRM, ERP, martech, and subscription systems. Integrations could push most of that data into any one of them, but they all typically have limited capability for analytics outside their core functions. The ERP can do financial projections all day long, but don’t ask it to report on the average time to first-time-value for a customer. All of this data needs to go someplace more suited to analytics.
Hosted data warehouses are all the rage lately, having supplanted “data lakes” as the repo-du-jour for business analytics. Data lakes never really sat well with me. (I recently heard them referred to as “data swamps.”) The idea of just taking data as it comes, regardless of schema or format, and putting it all in one place is seductively attractive, but throwing a bunch of raw ingredients into a bucket doesn’t make a meal.
Data warehouses, such as BigQuery, Snowflake, or RedShift (and many others) seem to strike a good balance between the rigidity of the RDBMS, the flexibility of NoSQL, and the abject chaos of the data lake. They are not completely schema-free, so there’s a little bit of design involved which requires some prior understanding of data content (gasp!), but they can be forgiving. If the incoming data stream starts to include a new column, they will take it for the most part. Hosted warehouses rely of the dynamic availability of compute to abstract away those pesky indices and such. The more complex the query or the bigger the data set, the more compute that’s allocated to process the query. This is great for those who don’t want to have to think about such things, but it can get pricey, so buyer beware.
I’ve been spending a lot of time in BigQuery, the data warehouse we chose to use. It’s “standard SQL” interface is quite comfortable, being reasonably compatible with PGSQL. As such, I felt at home pretty quickly. I’ve already built some views and some functions and, since it is a Google platform, it hooks up easily to Google Data Studio for data visualization. To be very clear, you can have my PostgreSQL when you pry it from my cold, dead fingers, but BigQuery has been a comfortable stopover.
I haven’t done anything with BigQuery GIS yet, but a tool from the GIS world has figured prominently into our nascent data pipeline – Safe Software’s FME. FME started out as “spatial ETL” and became the GIS Swiss Army knife for many people but, to my way of thinking, it graduated from the GIS world a while ago. It is now a world-class, low-code data integration platform that matches up well with SSIS, MuleSoft, TIBCO, Oracle, and many others. Its native support for numerous spatial formats and deep understanding of spatial types and concepts such as coordinate systems puts it ahead of most other tools in the category, in my opinion.
It turns out that FME natively reads from Salesforce and natively writes to BigQuery. This was a huge advantage for us because we’ve been using FME for a few years in another context, so being able to capitalize on our existing knowledge base to support business operations gave us quite a jumpstart.
Of course, there are other vertical systems we need to integrate and FME doesn’t have a native reader for all of them. It is flexible enough, however, to connect to arbitrary HTTPS+JSON APIs, process the data, and ultimately write to BigQuery. If you were to browse the documentation on FME transformers, there’s a good chance we’re currently making use of most of the ones with names that start with “JSON.”
So far, I’ve integrated two systems into production – one is very well-defined with a native reader (Salesforce) and the other involves using an API that returns CSVs. In total, I’ve written seven lines of Python code, with the FME’s GUI tools tackling the rest. We’ll finish integrating the arbitrary JSON API in the next two weeks or so, but don’t expect to add any custom code to the mix.
The only real hiccup I have had so far is in interacting with the CSV API. FME smartly pre-fetches a bit of each response to attempt to infer data types from the CSV output and it sometimes gets these inferences wrong. For example, an integer ID may be seen as 8-bit integer (unit8), which will not be sufficient for later rows. This can cause FME to write null values for those that don’t fit uint8. I solved this by making the input feature types editable and sizing them correctly. This can be somewhat risky, but that risk can be mitigated by the aforementioned understanding of data content. If it were me, I would make FME default to larger data types and let the user tailor them down as desired, but that’s essentially a stylistic difference and I understand their choice.
The Salesforce (SFDC) end of the process has actually been fairly interesting to me. This is because SFDC is highly customizable, enabling an organization to tailor objects to house the content that’s right for them, but the objects themselves are part of a generally well-understood sales workflow involving leads, opportunities, accounts, bookings, quotes, contacts, stages and other concepts that are generally well-understood in the sales world. In this way, SFDC is a semi-generalized tool for a specific vertical. Whereas a typical RDBMS is a green field, SFDC has laid the pavement for the road, but lets you decide how many lanes you want. The prevalence of SFDC has created a bit of a lingua franca in terms of sales tech that offers interesting insights into how reasonably well-designed tech can support operations not merely at the organizational level, but at the market or industry level. SFDC is far from a perfect platform, but I have grown to respect what it does.
GIS has always been something of its own vertical. Diving more deeply into tech focused on a different vertical has been educational and also somewhat refreshing. Additionally, seeing the growth of a tried-and-true geospatial tool like FME into a technology that can be applied across multiple domains has also been encouraging.