Data Engineering Design Principles You Should Follow
Design principles to follow that will take your data engineering skills to the next level.
Software engineering is far more mature than the current state of data engineering, particularly when it comes to principles. For example, in software engineering, SOLID principles are a set of five design principles that help in writing maintainable and scalable code. These are:
Single Responsibility Principle: A class should have only one job, and any changes to that job should only affect the class.
Open-Closed Principle: Software entities should be able to be added to without changing the existing code.
Liskov Substitution Principle: Programs should be able to use any subtype without knowing it.
Interface Segregation Principle: Interfaces should only contain methods that are needed for their specific use case.
Dependency Inversion Principle: High-level modules should not depend on low-level modules, both should depend on abstractions.
What do these really mean and how can do they relate to data engineering? If you’re curious about how they can be applied to software written in Python, I recommend the following article on how to design robust, maintainable, and extensible code in Python.
These principles guide software engineers to write maintainable, testable, and extendable code. From functional to object-oriented programming, software engineering has many design principles that help software developers be more efficient.
As data engineering has evolved in the past decade, many data engineers have come from different backgrounds. Some will have transitioned from software engineering to follow the hype where they will have brought some of the principles with them. Others may have came from a data analysis or data science background (potentially due to being tired of prevalent data issues and the desire to learn the fundamentals of the data engineering lifecycle).
As a data engineer, you will also need standards, design patterns and principles to follow. However, you might not need all that apply to software engineers. Let's explore which principles are applicable to data engineering (with simpler language than used in SOLID).
Precision and Formality: The importance of accuracy and conformance to standards and documentation.
Separation of Concerns: The idea of breaking down a system into smaller, more manageable parts that can be developed and maintained independently.
Modularity: Focuses on the importance of reusability and composability.
Abstraction: Hiding implementation details and exposing only the relevant information.
Anticipation of Change: Designing data pipelines that are flexible and can handle changes in data volume, data format, or data sources.
Incremental Development: Building data pipelines incrementally, with each iteration adding new data sources or improving existing data processing logic.
Reliability: Designing systems that are reliable and can handle failures gracefully, and detect data quality issues or missing data.
Modern Data Engineering is a publication that is supported by its readers. To stay updated on new posts, consider becoming a free or paid subscriber. Your support is greatly appreciated!
How can these principles be implemented? There are several acronyms to describe ways of working that have appeared across the data and software industry and serve as a quick reminder of the above principles.
When it comes to following the concepts of Precision and Formality and Separation of Concerns, KISS (Keep It Simple, Stupid) is a framework that enables simplicity, by building things that are easier to understand, manipulate and be maintained by a team. For example, by dividing systems into smaller parts, data engineers can reduce the risk of errors and ensure more efficient workflows between them.
Modularity and Abstraction are also key principles which align with the DRY (Don’t Repeat Yourself) framework that not only makes code easier to maintain and extend, but also prevents bugs in data by utilising data normalisation techniques. Creating generic applications has several advantages for data engineers. It saves time and effort by not having to start from scratch for each project. One way to achieve this is by creating a library of reusable components that can be shared across projects. For example, tools that interact with cloud services like uploading data to Amazon S3, or zipping up repositories and uploading it to a Lambda function. On the importance of reusability, this approach also promotes consistency and reduces errors by using a common codebase that can be extended and applied to similar projects.
These principles are related to data engineering because they provide a framework for designing data pipelines that are modular, reusable, and adaptable to changing requirements and environments, which by following these principles, can create more maintainable, scalable, and reliable data pipelines that can deliver value to their organisation. Applying good principles is a good habit for data engineers to have to excel in their career. From creating generic applications to understanding your infrastructure, following these principles can lead to more efficient workflows and better quality data.