Gleb Mezhanskiy is the CEO & Co-founder of Datafold – a data observability platform that helps companies reduce their data quality incidents and grow through effective use of their analytical data. Before this, he was a founding member of data teams at Lyft and Autodesk, where he built sophisticated data platforms and developed tooling to improve productivity and data quality.
Top 3 Value Bombs:
- The foundation of any data observability platform is the data catalog.
- Data observability becomes increasingly difficult the more data sets you have.
- Do not surprise your report consumers. Knowing how your metrics will change in prod before your deployment, can be done with the right data observability process and regression testing.
Why Do Many Organizations Struggle to Have Good Data Quality?
Gleb explains that there’s no single solution due to data quality being such a broad topic. Each company has its infrastructure, tools, protocols, and processes. Each team also structures their data dashboard differently and how they go about shipping and testing their work is different.
Hence, it is tough to pinpoint the source of error that causes poor data quality because of all the things that can go wrong.
Gleb adds, “when we’re working with customers that reach unicorn size. [with a] data team of maybe 10 to 30 people. They [are] already operating at the scale of tens of thousands of data sets. And so that complexity is impossible to manage and observe without dedicated tools.“
This makes dedicated tools important to companies that handle lots of data since managing tens of thousands of data sets is impossible. This is what Datafold helps companies with; they help companies grow by guiding them on how to use their analytical data effectively and reduce data quality incidents.
What Is Data Observability?
For people who are new to the data observability space, Gleb explains what it’s all about:
“. . . the way I define it for myself and for our company is it’s really about understanding the data and processes and infrastructure around it.”
It is also about data lineage, which is:
“Understanding the sources and also the downstream musicianship of data, and also things related to, operational aspects, such as, how often the data is updated and what are the specific processes, jobs, teams, and owners that work on this, on these data sets some more like SLA and operational monitoring.”
In summary, Gleb stresses the importance of understanding the data and the processes that happen around it. How can you use this data to build out your data dashboards to make better business decisions? What is the meaning of the data you collect? Is this valuable data to our teams? How can you present this data to your stakeholders?
All these questions are important to answer as data is information. Information is now everywhere compared to back then. However, the real decider is knowing which data to ignore.
Where does Datafold Fit Into My Data Stack? Why Should I Use It?
Gleb explains that “. . . the mission of data fold is, one to enhance the understanding of data. So, to enhance state observability, we do this by providing [a fingertips reach on] what data you have and how it’s works, [and] how it’s distributed.”
Gleb adds, “The second part is more related to data quality. We help automate several workflows related to data testing. More specifically, we help companies implement proactive data testing within their data development workflow.”
Data engineers often must build, maintain, and observe data that is important to running a business. This data can contain hundreds, if not thousands, of SQL code. When dealing with complex code, it is effortless to make a change that negatively affects the business. These changes can be minor such as changing the definition to what it means for a user to be in session.
The impact of these small changes can damage a company’s reputation or operation, and it may be irreversible. Datafold provides a “data diff feature helps every data developer understand the impact of the changes that they’re making through lineation through also a tool that we call data diff.”
Gleb adds that “The goal is also to show the impact on your metrics and the distribution side by side, so that you can answer questions, whether your metric on the dashboard will be impacted and how.”
If You Were Tasked to Design A Data Architecture for A Fortune 500 Company, What Would It Look Like At A High Level?
“that one of the most important things are not necessarily about the technology, but about the approach to how to choose a technology. And I think one of the biggest principles that I would try to use is to try to outsource as much as possible while building a data platform to vendors. Because unless you see a certain piece of software as a key differentiator to your business, then you’re much better off solving that with money with an outsource vendor than doing it in-house.”
Gleb adds, “I think [what’s also] important is scalability”. This is because when a business grows, so does their data sources and complexity. This will require more data to be used, and the more data you have, the harder it is to observe and keep organized. While having a dedicated data infrastructure built for your company is excellent, it is also essential to keep in mind the scalability of that data infrastructure.
Favorite Data Book or Resources You Would Recommend To Listeners
Gleb suggests How To Measure Anything by Douglas Hubbard. Gleb adds, “It is about why we use data and how to think about the cost of doing analytics versus the value of doing analytics for companies.”
Gleb also recommends “97 Things That Every Data Engineer Should Know” which he contributed to and many other data practitioners. Gleb also suggests learning from other people’s mistakes and stories and figuring out what you can take away from it. This can be done by reading essays, blogs, newsletters, and many more.