Why Do Organizations Struggle To Have Good Data Quality?
Large organizations are often using many tools to get their data from source to destination, and this is what Josh highlights as the main root problem to having poor data quality. Josh emphasizes that:
” there’s just a huge number of tools that are being used to do all the processing of data, and ultimately deliver it to the end location. Being able to tie together all the needed information across this, wide variety of tooling is, is a challenge and a pain point.”
Josh also points out that this fragmentation of tools creates unnecessary complexities in all areas of operation such as tracking down issues within their data infrastructure. Not to mention that many companies have hundreds or sometimes thousands of data sources going through different levels of data processing in many different data pipelines. Going through all these data to find and fix an error can be time-consuming and sometimes very complex.
Josh highlights that “another reason why this is such a challenge for a lot of organizations is because each company is working in such a unique problem set when it comes to the analytics that they’re producing.”
For example, two marketing companies can have the same goal but the way they handle their marketing data sources may be unique from each other. Josh states that “being able to come up with a generic approach to cover all the different data quality use cases and how different teams are looking at [their] problem and their own unique way [is definitely a big challenge] . . .”
What Should We Be Observing About Our Data?
When observing our data we should be looking at our data and pipelines. It is important to separate these two from each other, while they both work together, they also both serve different purposes.
Josh says that “on the data level, we’re concerned about what’s happening in the actual flow of data.” This means observing the successful poll of data, completeness of a data set, how many records are being pulled at a particular time relative to history, etc.
The second component we should be observing is the data pipelines. This can be observing information such as “the statuses of the pipeline . . . did it succeed, or did it fail to run? [This is] important because failures on the pipeline are obviously going to be very associated with, delays or misses of pulling data sets.”
Observing data and pipelines can also be good indicators of issues in the other. For example:
“if you see that data’s incomplete within a data table in a warehouse downstream, being able to easily see that [it] was caused because pipelines failed upstream somewhere, that can help expedite root cause analysis. So, we really care about statuses.”
What Patterns from Organizations Contribute to High Data Quality and Observability?
Josh states that “the biggest pattern that [he sees] . . . [is] just having a really bad inciting incident that [motivates] the team to get something right.” For example:
“. . . an executive [goes] into a board meeting, [talking] about some marketing statistics [or] some business progress, and then [realizes] after the meeting that the data was completely off and needs to go back and [apologize] to some really important stakeholders that the data was wrong.”
Problems are inevitably going to occur in any business but the difference between a successful business and a business destined to fail is how they handle the problem. When incidents like these occur, the business should be focusing on fixing it, but also learning from it.
Therefore, when your business encounters a problem, focus on learning from it instead of focusing only on fixing it.
According to Josh, another pattern that successful organizations have is: “. . . a clear SLA [or] service level agreements on just what data quality means . . .”
“When an organization can break that down and work with their stakeholders to define what quality means for them and what are the priorities, because you can’t focus on everything. You can have some level of contract [whether it’s] informal or formal that says we’re going to abide by these criteria [ and that] we’re accountable to [it] . . . I think [is the] key ingredient for success today.”
What Steps Should a Newcomer Take?
If you are a starting out and don’t have all the data capabilities set up. Josh emphasizes setting aside the
technology and “sit with your stakeholders and understand where the priorities are”. This is because lots of teams tunnel vision on to the “technologies and cool capabilities” and how it can benefit their company or client. Josh emphasizes:
“[that] we can get caught up in all the different capabilities of these systems [and the] danger there is you risk missing what’s most important to the business. . . [this causes you to waste] a lot of time on projects that may not have any significant impact.”
The best way to understand the priority of your project or stakeholders is to have a meeting and prepare questions beforehand. The most common question to ask is “Where is the biggest pain coming from?”. This simple question can extract valuable information that a business is struggling with and your team can add more questions on top of it.
Alternatively, Josh suggests that:
“you may not even need to really sit with them, but just sit with your team for a second and ask, where are we wasting most of our time? Is it from stray queries that are being run on a warehouse? Is it from a table transformation that’s clogging things up all the time or creating a lot of failures or is it because there’s one or two rest API data sources where the source provider is just constantly changing schema?. . “
Once these priorities are set, pain points are discussed, this is where you start focusing on “different approaches, different techniques, [and] different tools that you may use to help you.”
Recommended Learning Resources
Josh recommends a book called “The Phoenix Project”. He explains, “it’s about a dev-ops organization that is facing a giant backlog of requests and tasks from a big enterprise that they’re working in. And how [this] team breaks down big problems into smaller problems and starts developing a more clean pipeline of pulling tasks through to completion. It describes different CICB processes that go on within that group [and] I think it is a good resource for data teams because I think this is a lot of what data teams are missing today, is good practices around ops.”