Corey Sunwold
3 min readNov 22, 2021

--

If we agree that domain owners should have complete ownership of their data, or in other words that we agree Data Mesh is a worthwhile goal, our experience with DevSecOps shows us the potential obstacles we will face in this transition but certainly is no silver bullet.

> We had concepts like A/B tests to serve web requests, compare and harden our call paths. Not sure what the equivalent on data is.

Data quality is, as far as I'm concerned, an unsolved problem. A/B testing, automated service diffing (see Twitter's Diffy for an example of what I mean here), all of these offer objective, measurable tests on your services to guarantee quality. Most engineers who spend their time building web services probably take this power for granted.

With a these service measurement tools, we can make a change and objectively tell what the impact was. If I make a change to my data, how do I do the same?

First, there is the problem of knowing who is using my data. This is data lineage. Sometimes I think that solving data lineage is one of the foundational pillars to solving data quality, but then I look back at my experience with web services and remember that no one has really solved the problem of identifying and tracing service dependency graphs. Yes, software does exist to solve this problem, but more often than not making a major breaking change to a service requires some amount of "Scream Testing" by which you temporarily roll out a big breaking change and listen to see who complains. Fortune 500 companies work this way today.

Second, there is the problem of measuring the impact of a change to the data. Even assuming I know who is using my data and what services or other applications or reports are using my data how do I measure the impact of a change in my data? For ML models that are infrequently re-trained it might be weeks or months for me to find out something went wrong. For infrequently viewed reports its easy to imagine a really terrible scenario where someone in finance is preparing a report for an upcoming board meeting next week only to find out the last quarter's worth of data on their report is broken.

One imperfect solution I've seen is to treat data as a versioned API. For example: this is User Account table V1. If I want to make a breaking change I can create V2. If and only if I have a data lineage solution, I could contact whomever depends on V1 and migrate them to V2 so that I can deprecate V1. This may have cost and performance considerations as high volume data may not be easily or effectively copied to multiple versions.

I watched a talk recently on the future of Hive Metastore: https://www.youtube.com/watch?v=7_Pt1g2x-XE. In it, Ryan Blue, creator of Apache Iceberg, mentioned the importance of allowing people to easily and safely make mistakes with data. When you think about it thats what an A/B test does. It allows you to take a risk, measure the impact, and either move forward or go back depending on what you learn. I think that is at the heart of what is missing with data. We lack effective ways to make cheap mistakes. Versioned data, depending on how it's implemented, may or more likely may not solve this.

The advantage to DevSecOps was that it shortened feedback loops. Security wasn't a roundabout problem that required a game of telephone to solve, ownership was pushed to the people with the most control to make changes and they got it done faster. Similar for operations. Deployments move faster, outages are hopefully shorter, and software gets better, faster.

How do we bring this to data? We can't tolerate mistakes with data today because the cost of fixing them is so high, and the time to learn a mistake has been made was so long. To improve data quality, to unlock more potential from data, and to ease the transition to Data Mesh we need tools and processes that allow people to make mistakes with data cheaply, get feedback on their mistake quickly, and then either move forward or backward as easily as dialing up or down an A/B test.

--

--