Through my experience consulting for various Fortune 500 organizations and interviewing data leaders on my podcast you start to easily recognize the foundation principles of modern data architectures. The following list is what I would consider the top ten most critical concepts and principles to keep in mind when designing a modern data architecture. While data architecture is different depending on business requirements, consider these principles as foundational principles or best practices when designing data architecture. Below is a quick highlight:
Culture is Everything: Only 24.4% have established a data culture within their firms (1). You must have a data strategy in place and treat your data as an asset.
Metadata is King: 58% of companies struggle to understand the quality of their key data sources (2). Having a shared data catalog that everyone contributes to and capturing metadata across the ecosystem is key to taking advantage of modern development principles.
Data Democratization: Empower your business users to understand and access the data; self-service BI enables a shared understanding of that data.
Build for Flexibility – Anticipate Change: Create modular and abstracted architecture; this will allow you to take a best-of-breed approach and enable you to change out the various components based on requirements without affecting downstream or upstream dependencies.
Bulletproof Deployments with DataOps: Avoid design architectures with solely a prod focus. Because systems are complicated, how they are deployed is just as important if not more so, than the tools used if not more so.
Streamline “Reverse ETL”: Getting data into a data warehouse or lake is only the first step, you must ultimately deliver that data back to the users. Streamline these access patterns.
Consistent Ingestion: 74% of organizations use six or more data integration tools (3), which results in high maintenance costs and lowers agility. Standardize your ingestion patterns with as fewest possible tools.
Reduce Data Copies and Movement: Regardless of the size or diversity of your data, organizations must be able to quickly understand and act on it.
Automate Data Monitoring: More than 80% of Datafold survey respondents said that they regularly run into data quality issues (4). Good data quality processes should alert you to anomalies in your metrics before your data consumers do.
Prevent Vendor Lock-In: When deciding on technologies for your data architecture, it’s essential to design it in such a way that you can create a best-of-breed approach where you can swap out technologies in the future if requirements change due to costs increase or better technology.
What are your data architectural patterns? Please let me know in the comments below.
Prefer to listen instead of reading this post. Here is the corresponding podcast…
Culture is Everything
No matter the principles and tools used to create a data architecture, they are all meaningless unless your organization has a strong data culture. This lays the groundwork for implementing the remaining nine guiding principles to modern data architecture.
Becoming data-driven means investing in data scientists, data engineers, data analysts, and also ensuring that the organization collects useful, accurate, and clean data in the right format, AND that your business users have access to and an understanding of how to use it.
It all starts from the top with having key executives responsible for enabling data across the organization and investing the time and resources to do it properly.
A data-driven culture is not a one-time project, it’s an ongoing process that adapts and grows as the organization changes.
The data architecture should be driven by the business not IT.
You need to have a data strategy that aligns with your business strategy and treat your data as an asset. Check out this podcast episode on managing your data like an asset with Doug Laney and his book, Infonomics: How to Monetize, Manage, and Measure Information for Competitive Advantage to learn more on monetizing your data.
Metadata is King
Having strong metadata is key in modern data platforms and supports much more than just “documentation.” Not only is it required to support regulations like CCPA and GDPR but metadata is central to supporting the below data areas:
- Data Quality & Governance: It’s critical to understand the context of the data across the organization, who owns it, where it originated, sensitivity, accessibility, and most importantly, accuracy. If all of this is captured you can set up processes and alerts to ensure your data remains high quality. Data governance is vital in unstructured data storage (i.e., do not let your data lake turn into a data swamp). The data catalog should contain all data asset definitions, KPI definitions, data owners and, the source of the data. This should be maintained by all stakeholders of the data asset.
- Data Pipelines: Today’s advanced tools can make it very easy to create data pipelines manually, but this practice can grow into a difficult-to-manage web of transformation logic. Data pipelines should be dynamically created off metadata describing when, how, and where to load data.
- Deployments: CI/CD (Continuous Integration and Continuous Delivery) are not new concepts but recently, its patterns have been shifting more into the data landscape from the application space. Strong metadata allows the creation of automatable tests along with environment-specific deployment settings.
- Data Monitoring: Are your KPI metrics shifting and users are asking why? Is it a bug? Is it accurate? Capturing metadata on the raw data and the metrics can provide a clear picture of whether the change is expected or introduced through a deployment.
We all know data is necessary but why? At the end of the day, data empowers employees to be more efficient and make decisions based on evidence which allows them to pivot quicker than their competitors. Data is a shared asset, and everyone across the organization should have a common understanding of the various data assets.
Do your metrics have the same meaning across departments? For example, do monthly sales across business units mean the same? They should! All departments and data stakeholders need to share a common vocabulary. Before you can start to build out advanced analytics, you first need to build a strong foundation.
Regardless of your role or technical ability, you should have access to the data you need to perform your role within your organization. A data engineer should not be the gatekeeper or bottleneck to view organization data.
Eliminating regional, business unit, and departmental data silos are the more effective ways to ensure that organizational stakeholders have access to the data they need to drive insights and also receive a 360-degree view of the business.
Check out this podcast with Tarush Aggarwal the founder of 5x where we discuss setting up a self-service BI tool and its importance.
Organizations that empower their employees to feel confident about data and know where to go will outperform their competitors every single time.
Build For Flexibility – Anticipate Change
Create an architecture that is modular and abstracted; this will allow you to take a best-of-breed approach and enable you to change out the various components based on requirements without affecting downstream or upstream dependencies. Consider the following points when designing with flexibility in mind:
Design for solutions to be cloud-agnostic. Even if you plan to stay with a single CSP, don’t back yourself into a corner where it makes opportunity costs high to switch.
Design solutions that can be easily deployed, for example, building solutions on top of Kubernetes and Docker.
Utilize APIs to standardize data access and create an abstraction layer, which will allow you to change out the underlying technologies without impacting its consumers.
Create data pipelines driven by metadata. For example, if a source column is added, it will automatically be ingested.
Machine learning and the artificial intelligence space is evolving at an incredible rate. The patterns you deploy today may be obsolete in a year. Plan to switch components out in the future to avoid problems down the road.
Big data is relative – we do not know what the future holds.
Bulletproof Deployments With DataOps
Having a great data architecture is not enough, you also need strong DataOps.
DataOps is an application of the Agile methodology and DevOps principles to deliver high-quality data products in a short period of time.
DataOps principles help data teams deliver, deploy, and make decisions faster.
DevOps is now standard for application teams due to the time ability to quickly release code changes safely. These same principles are now being applied to data deployments.
The methodology and philosophy of DataOps are just as important as the tools themselves.
Check out this podcast with Chris Berg the CEO of DataKitchen, a DataOps platform, where we discuss all things DataOps and how to get started with it. If you’re new to DataOps I recommend starting with this free E-book from DataKitchen.
Streamline “Reverse ETL”
Reverse ETL is not a new concept, but it is a “new” term. Companies have been creating reverse ETL solutions since the data warehouse existed typically by creating custom APIs to connect it to target systems.
Reverse ETL allows companies to synchronize the data warehouse or data lake with third-party apps such as Salesforce, HubSpot, Marketo, Zendesk, and many more.
Typically reverse ETL tools come with pre-installed API integrations, simplifies maintenance, and empowers business users by bringing the data they need to the applications they use. However, even if you do not use a “reverse ETL” tool, the critical concept is that you recognize the need to bring powerful insights from your DW (data warehouse) or data lake back into business applications in a streamlined fashion.
Check out this podcast with Tejas Manohar, Co-Founder of Hightouch, a leading Reverse ETL platform that syncs data from your warehouse or lake back into tools your business teams rely on.
Consistent Data Ingestion
With almost three-quarters of organizations (74%) using six or more data integration tools (3), making it very difficult for them to be nimble and quickly ingest, integrate, analyze and share their data, and incorporate new data sources.
This is a critical area for data and analytics leaders evaluating data integration tools to concentrate their efforts. They must prioritize tools that enable the orchestration of multiple modes of data delivery and are not biased toward supporting a single style. As the data format of source data continues to increase, try to develop ingestion patterns that can consume more than one type of data format.
These days, integration tools are plentiful, and having more than a few ingestion tools adds unnecessary overhead and skill gaps.
Identify which ingestion patterns you need to support (i.e., Real-Time / Streaming, API, Batch, CDC, FTP) and then work with tools that meet most of your data formats.
Check out this podcast with Raghu Murthy founder of Datacoral where we discuss all things data ingestion and why if you’re migrating relational data that supports CDC, you should be using CDC to migrate it for the majority of use cases.
Reduce Data Copies and Movement
No matter what the industry is calling your data storage these days (i.e., data lake, lake house, data river), you should always try to reduce data copies and movement.
Simplify your data platform.
Not all data use cases require you to copy data, and doing so only creates more costs and overhead. Copying data can quickly turn into a game of trying to find the source of truth if it’s not well documented.
Two Methods to Reduce your Data Movement:
Many of the leading technologies (Snowflake, Azure Synapse, Big Query) can run queries across all of your data and join data from your DW to data stored in your lake. You no longer need to copy data from one system to another.
Depending on the need, data virtualization can eliminate the need to copy all of your raw data to a central location by providing a logical data layer that integrates with your sources and allows consumers to join and query across the various data sources.
“Organizations do not have a data volume problem but rather data readiness problem.” – H.O. Maycotte, founder of Molecula. In this episode, we discuss creating a real-time feature store to support AI/ML needs.
Automate Data Monitoring
More than 80% of respondents in a survey from Datafold said that they regularly run into data quality issues (4). Yet, 45% of organizations are not using any data governance/quality tools.
We’ve all heard horror stories of presenting BI metrics to leadership and being asked why sales have decreased by 28%. Is it accurate? Did source data change? Was a bug introduced in a code change? How confident are you in that metric?
When executives question the metrics, you won’t have to worry if you have the right automated tests in place.
There are numerous automated tests that can be configured to support data products. Below are two main categories of automated tests:
Unit Tests: these tests that the actual code and logic are working per requirements.
Metric Monitoring: this identifies if your metrics or data is changing in expected ways, possibly caused by code changes or source data. Large deviations can alert you.
Monitoring your data is critical to achieving high availability and support disaster recovery.
To learn more about increasing the quality and reliability of your data check out this podcast episode with Lior Gavish, the co-founder of Monte Carlo.
Prevent Vendor Lock-In
Over the past few years, there has been a plethora of new data tools and technologies introduced to the market. They typically fall into three broad categories: open source, cloud service providers (AWS, Azure, Google..), or third-party vendors that run their tools on the CSPs.
When deciding on technologies for your data architecture, it’s important to design it in a way that allows you to create a best-of-breed approach where you can swap out technologies in the future if requirements change, costs increase, or if better technology is created.
You want to avoid technologies that “lock you in.”
Vendor lock-in occurs when it’s difficult for an organization to switch from a vendor due to multiple reasons like cost, duration, or skill gap.
Many large enterprises are following a multi-cloud strategy, and limit the use of that particular CSP product. This means they can leverage available pricing, features, and the infrastructure components of each provider.
For example, if you’re using a business intelligence tool to contain all your transformation logic it makes it that much more difficult to switch to another BI vendor. The alternative would be to only pull in transformed/cleansed data into the BI tool and minimize the calculations done on the data.
Do you agree with these principles when designing a data architecture? If you have made it this far, I would love to hear your thoughts in the comments section below. I will personally respond to each comment. What are your design principles?
- New Vantage Partners. (2021, January 1). Big Data and AI Executive Survey 2021. https://www.newvantage.com/thoughtleadership.
- Dataversity. (2020). The 2020 State of Data Governance and Automation. https://content.dataversity.net/rs/656-WMW-918/images/DATAVERSITY_erwin_State_of_Data_Governance_2020_Final_012420.pdf.
- What is a data fabric? Talend. (2021, June 1). https://www.talend.com/resources/what-is-data-fabric/.
- The state of data quality in 2021. Datafold. (2021). https://www.datafold.com/blog/the-state-of-data-quality-in-2021#:~:text=%E2%80%8DMore%20than%2080%25%20of%20respondents,run%20into%20data%20quality%20issues.&text=What’s%20interesting%20is%20that%2C%20according,teams%20and%203rd%2Dparty%20vendors.