I’m really excited about this episode where we will speak with Kapil Surlaker, the vice president of engineering at LinkedIn. Kapil has been with LinkedIn for over 10 years and has played an instrumental role in shaping the data architecture that LinkedIn is built on top of. In this episode, we cover a wide range of topics surrounding data architecture from:
How metadata is captured and served up
Future-proofing the data architecture
The shift from on-prem to Azure
How LinkedIn monitors the quality of there data in real-time
Top 3 Value Bombs:
The importance of future-proofing your architecture
Data providers push metadata to a centrally stored location
Building in mechanisms to explain variance in key metrics
Below are the main data technologies discussed throughout the conversation:
Kafka – Open-source distributed event streaming platform, Kafka is used to process the telemetry data being generated from LinkedIn.
Espresso – Home-grown document-oriented NoSQL data storage system created about 10 years ago. Expresso serves the majority of traffic on LinkedIn.
Apache Gobbiln – Home-grown universal data ingestion framework for extracting, transforming, and loading large volumes of data from a variety of data sources. LinkedIn has open-sourced this.
Hadoop HDFS – Datalake, as LinkedIn transitions to Azure they will be shifting off of HDFS to Azure Data Lake Service.
Apache Spark – Used for the majority of data transformation and machine learning. For deep learning, TensorFlow is used.
Apache Pinot -Home-grown real-time distributed OLAP datastore, that LinkedIn has open-sourced. Pinot is used to support searching on LinkedIn.
Graph DB’s – We did not dive into much detail here.
Presto – Distributed SQL query engine to queries on top of the HDFS data lake.
The below is a very high-level data architecture based on our conversation and the components discussed. There are many more layers and technologies.
Currently, this infrastructure is on-prem in data centers throughout the world. LinkedIn is in the process of shifting to Azure. They are having to pay more attention to the performance of the various components when they are migrating to the cloud since the current design has been optimized on-prem which has allowed for a high degree of flexibility and proximity of components.
How is LinkedIn future-proofing the architecture?
One method is LinkedIn has created an abstracted data access layer called DALI (Data Access at LinkedIn). This allows for changes to underlying datasets without affecting the consumer.
What’s the general architecture or approach to capturing and making metadata available?
LinkedIn open-sourced a homegrown system called DataHub in 2020. DataHub is a generalized metadata search and discovery tool. DataHub provides key information for more than a million datasets at LinkedIn and is accessed by more than 1,500 employees every week. Key features of DataHub include:
All metadata is centrally stored and served
The metadata capture follows a push model to gather the metadata. Dataset owners/producers send data to DataHub through an API or a Kafka Stream. Publishing of the metadata is decentralized and the responsibility falls on the dataset owners.
Consumers are able to identify who owns the data, where the data set originates, identify underlying sources in derived datasets, lineage, and many more attributes.
What are your top methods/processes to ensure high-quality data?
LinkedIn is very metric-driven and is capturing data about data to identify potential issues and causes. LinkedIn takes a 3 pronged approach to ensure the data being produced is high quality and remains that way:
Data Obseravaibility (Ability to detect when issues arise) – LinkedIn developed a tool called Data Sentinel, which automatically validates the quality of large-scale data in production environments. It’s checking to make sure that schemas are matching, data values match expectations, times of arrival meet expectations.
Root Cause Analysis (Ability to understand why the issue occurred) – When changes are found in metrics/data sets are you able to explain it? Could it be due to seasonality, a recent deployment, or something else? Linkedin developed a tool called ThirdEye that aims to solve that exact question. ThirdEye connects to a large number of data sources to gather information and learns over time to generate anomalies detection and tries to correlate/identify the root cause. ThirdEye was built on top of Pinot a distributed OLAP datastore that LinkedIn developed and recently open-sourced.
Prevention – If the root cause identified was due to a bug then creating a fix to mitigate the issue going forward.
How does LinkedIn enable strong collaboration among the various data stakeholders (producers and consumers)?
One of LinkedIn’s core principles is called #OneLinkedIn, where they “make every decision, big or small, with care.” This is the foundation that has enabled strong collaboration among the various teams.
To facilitate the change management of the various data changed across the organization, LinkedIn created a data user council. This user council met on a cadence and was an opportunity for IT to share changes ahead of time and receive feedback from the various stakeholders and an opportunity for them to contribute.
What were the biggest lessons learned through the development of your current architecture? While technology is a key component to their success, it was only made possible because of the large executive buy-in which facilitated strong collaboration between the different LOB’s.