Listen on iTunes  |   Listen on Spotify  | more directories coming soon 🙂

Episode  Summary:

In today’s episode, we will speak with Jesse Anderson and learn how to run successful big data projects and how to resource your teams. Jesse is a big data expert at Big Data Institute, who’s worked with startups to Fortune 100 companies. He has taught over 30,000 people the skills to become data engineers and is published in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired.

Top 3 Value Bombs:

  1. It’s not easy to retool your traditional DW team to support big data technologies
  2. A general ratio you should have is, 2-5 data engineers for every 1 data scientist
  3. The importance of having three data teams and having them staffed properly which include operations, data engineers, and data scientists

You’ll Learn:

One reason why so many organizations fail when shifting from a traditional BI/DW to more of a modern data architecture:

  • When shifting from traditional data architecture to more of a modern data architecture many organizations fail to recognize that it requires a different set of skills. If you just tell your DW team to go build out a data lake they will most likely not of the strong understanding and programming skills required to build a strong architecture. Jesse summed it up nicely with “Retooling not just of the new technology, but it’s a retooling of the people.”  

In Jesse’s book Data Teams, one of its core principle is how there should be 3 main teams to support data assets: 

  1. Operations: Responsible for ensuring the infrastructure is running smoothly. The size of the team varies on the complexity of technology and the impact if there were to be a disruption. 
  2. Data Engineering: Responsible for building out the data pipelines to ingest and transform data to make it available across the organization. The size of this team is typically for every one data scientist you may have 2-5 data engineers. 
  3. Data Scientist: Responsible for building out models and AI solutions. The size of this team is typically for everyone data scientist you may have 2-5 data engineers. Google has a paper called “Hidden Technical Debt in Machine Learning Systems” that highlights potential debt if you have data scientists building the data engineer pipelines. 

These teams each have their own strengths and purpose and you really need all three to be successful. You want to avoid having your data scientist perform engineering work and vice versa. It’s not what they are strong in and most likely not what they are satisfied in and will lower job satisfaction. Jesse has created a great visual that shows the skills required across the three teams and the overlap, you can view the Technology Tree link on his blog. 

Three guiding principles when building out a data architecture:

  • Keep it simple, don’t try to make things more complicated than they need to be. 
  • Depending on the business scenario, but generally speaking, centralizing your data in data-lake. 
  • Understand the intent of the data products on the market by reading their original papers. What’s their bread and butter. Many tools/vendors will oversell their products and bolt-on technologies. 

Where do you see data architectures heading over the next 2-5 years from now?

  • Increase in programming required to build out the infrastructures but also more desire to be able to use SQL throughout the data landscape