Learning Data Engineering in 2023
What core skills do you need to succeed as a Data Engineer? What is the state of data engineering today?
In 2023 the demand for data engineers is showing no sign of slowing down. Here we’ll be going over the core skills a data engineer should have and the concepts that have survived the years, with some excellent resources to look at. First let’s take a step back and understand the state of data engineering today - what has changed and what has remained - from the early days of data practices to the modern data stack.
The State of Data Engineering
Data has always been a part of human life, the thing that has changed is the mode of storing and transforming it.
Before the invention of relational databases in the 70’s, the world relied on punch cards and other old technology to store data. With the development of modern tech boom in the early 2000’s, the term Big Data was coined by Roger Mougalas and Hadoop was used (and still is today) to handle large datasets. Big data processing proved useful for providing organisations key insights into their customers for marketing purposes, to increase conversion rates and boost customer engagement. It also benefited other industries such as healthcare, where disease risk factors could be identified. Hadoop technology can be summarised by the following
Open source tool for storing data and running applications on clusters of hardware
Based on the distributed computing concepts of MapReduce for parallel processing and HDFS for distributed storage
Scalable infrastructure that is built with fault tolerance, reliability and cloud capabilities
It’s useful to understand the foundations of Hadoop technology even if you won’t be part of your tech stack. It teaches some of the fundamentals of computer science and brings to light the designs of distributed compute technology that is prevalent today.
Data engineering is still an infant
It’s already clear that in 2023 data engineering is still a young field trying to define itself. We’ve seen companies like Airbyte and Snowflake grow tremendously of which some were in their infancy just several years ago. The hype around how far cloud based offerings and other frameworks have come can sometimes mask the young age of the field.
New frameworks like Mage have launched to battle with beasts like Airflow and we’ve also heard a lot about Rust and how it might be the next de facto language for data engineering after Python. Only time will tell if that happens, although I would expect that Rust is adopted alongside Python since the data engineering ecosystem is mainly Python.
A welcoming trend as a result of increased complexity is the greater adoption of software development principles and processes. These are key to creating solid foundations as data teams grow, to ease the pain of dealing with multiple pipelines, products and programming languages.
Data governance, observability and quality is still a challenge, but its getting better
With all of this comes greater importance of data quality checking and companies are taking multiple approaches. Some include applying governance on the data warehouse itself or at the point of ingestion of the data. Instead of frequently addressing data quality issues on the fly, new data observability tools have allowed for automated monitoring and greater insights into issues that cause downtime.
Open is better for innovation
When it comes to other data tools, we’ve seen the trend of open source being the answer. Companies are starting many open source projects so that innovation can be fostered and the community can contribute. This is a key piece to the success of the Modern Data Stack as we see new tools like DuckDB for fast analytical querying and Metabase for displaying your data.
What do data engineers need to know?
The kind of work a data engineer does can vary hugely per company, depending on their tech stack, size and business. Some may argue that data engineering is closer to software engineering than data science, as data engineering focuses on the infrastructure, building tools and services. These data engineers might be fluent in languages such as Java or Scala and companies like AirBnb and Spotify are known to hire with this skillset in mind.
Others will say that data engineering is the younger (but now stronger) sibling of data science, carrying the weight of all the issues, where the focus is building data products for analytics.
I started my career in data analytics when being a data scientist was the coolest job and at some point became a recovering data scientist
Lately we have also seen the rise of other roles such as the analytical engineer, which might still be considered a data engineer role, but focuses on the analytical side of things and will mostly with languages and tools such as SQL, dbt and data modelling.
To be a well rounded data engineer, here are the key concepts that will give you a great foundation to work from, whichever kind of data engineer you feel like
Data Modelling - Understanding how data can be stored in a data warehouse is a must. You should know about the Kimball dimensional modelling vs the Inmon identity model and what normalisation is. Read about 3rd normal form and what kind of efficient table design works for your data, like indexing and primary keys
Databases - What the difference is between an OLAP and OLTP database. Why use a data lake? I highly recommend reading The Fundamentals of Data Engineering to understand databases in more detail
Batch Processing - ETL vs ELT. Batch processing is the core of data engineering to get data from a source to a destination. Airflow, Pandas and dbt and examples of tools used
Stream Processing - For real-time data processing of events with technologies like Apache Flink, Spark and Kafka
Data Formats - Knowing the difference between text based formats like CSV and JSON and columnar formats like Parquet. Designing Data Intensive Applications goes into great detail on all types of storage formats and is a good read.
Architecture - A high level understanding of the platforms, frameworks and tools at your disposal so you can make better architectural decisions and save you and your team time.
Distributed Compute - As mentioned in the beginning, familiarise yourself with distributed computing principles such as partitioning as you may already by using tools like Spark, BigQuery or Snowflake, which share the same technological foundations.
SQL and Python are still your bread and butter, but there’s more to learn for a solid foundation
There will always be a set of skills that are foundational in data engineering, to enable you to you query data or write basic data pipelines. The following skills will make you a more versatile data engineer and add some breadth to your knowledge base
SQL - The common denominator across all data disciplines, a data engineer is expected to have a solid grasp of aggregating data, joining data, using window functions and writing common table expressions.
Python - You don’t need to master all of Python to create data pipelines, but you’ll need to understand the basics of data types, especially working with dates if you’re going to be working with time series data. Know how to write loops, if statements, functions, classes and how to read and extract data. Understand how to transform data with Pandas and Numpy and explore other key aspects like logging and working with JSON.
Linux Bash - Being able to navigate and do things on the command line will not only impress others but make your life easier. Other tools will be accessible via the command line and doing things like ssh-ing into virtual machines, generating certificates, editing files with vim, zipping files or installing software will be a breeze.
Docker & Docker Compose - Containers are essential for making packages shareable between environments, speeds up development and removes complex dependancy issues. For example, I used Docker to create a multi service deployment with Airflow, Postgres and Metabase in project I wrote on personal finances
Documentation - Doc strings in your python code are the first step to writing good documentation, which can be automated with extensions like autodocstring if you use Visual Studio Code. Producing written documentation on your workflow makes repeating steps in other environments easier and gives your work some context. Automated documentation tools like mkdocs are extremely useful to generate API reference documentation. Here is an example which I used for an API wrapper I wrote in Python.
DevOps - Knowing how to use Git within your team is important for collaboration. Familiarising yourself with tools like Terraform can help you understand how to setup services using code to automate your infrastructure set up.
Testing - Catching bugs before deploying your code to a production environment and reducing complexity in your codebase. Unit testing is one of those skills to learn at the beginning to set you up as its often left behind.
Cloud - It’s a no-brainer that cloud skills in 2023 will help as many tools are cloud based and understanding the common concepts between the major cloud providers will give you enough to begin. For example, using S3 for data storage or setting up a Postgres database. I would recommend going for one basic cloud certification such as AWS Cloud Practitioner.
Orchestration - It’s likely that you’ll encounter some sort of orchestration tool like Airflow, Prefect or Dagster. Being familiar with CRON is a good first step to automating scripts locally. I found that setting up Airflow was a good learning experience and I’m looking to explore newer tools like Mage in 2023 to see how they compare.
Data Quality - Implementing data quality checks throughout stages of your data pipeline will prevent downstream data issues. Great Expectations is a popular open source tool of choice which integrates well with other tools and is highly customisable.
Some data engineers will endeavour to become specialised in certain topics or cloud services depending on their interests or company they work and would involve a stronger focus on some of the core skills above. Being able to choose whether you want to become an expert at building infrastructure or creating data products is the great thing about learning the fundamentals as you can then refine certain skills.
Closing Thoughts
Despite the flurry of new tools and frameworks we’ve seen in the past couple years, the fundamentals remain the same and are not going anywhere for now. We can expect to continue to see a trend in things like data contracts as data quality and governance continue to be hot topics.
We’ll also continue to see companies push for streaming and the development of tools to facilitate it like Quix. The competition between the different tools for orchestration will continue as newer tools like Mage aim to integrate other aspects of a data pipeline such as data quality, making the architecture of a pipeline potentially simpler.