A Guide To Creating Your First Data Engineering Project
Data engineering projects that stand out from the crowd.
Knowing The Basics
Getting your first data engineering role can be difficult depending on what kind of experience and skill level a company is looking for. The transferrable skills from software engineering and data science makes the transition easier than starting out from scratch and so more often than not, data engineering roles require some years of experience with programming languages like Python and SQL, data warehouse knowledge or similar experience from other data related roles.
Starting data engineering projects which solves small problems or adds value to your life in any way is a great method of familiarising yourself with the many tools data engineers use and will boost your chances of landing a role. This is especially helpful if you are early in your data career and don’t have as many years of experience under your belt.
Learning data engineering all at once isn’t doable - it will be overwhelming and might even hinder your progression with the basics. A good project will utilise the foundational tools and concepts and is a chance to hone these skills. To sum up my previous post on learning data engineering in 2023, start with:
SQL - Be comfortable with the commands SELECT, WHERE, GROUP BY, JOIN aggregate functions and window functions like SUM, PARTITION BY
Python - Learn how to write loops, classes, dictionaries, tuples and arrays. Familiarise yourself with the most common packages like Pandas and Numpy.
Orchestration - Look at how you can use Mage, Airflow, Prefect or Dagster to automate scripts. Docker will also help with scaling and running your projects on any machine.
Cloud - Understand the cloud concepts and how to create resources like databases and virtual machines.
Distributed Compute - Will Snowflake, BigQuery or Spark be useful in your project? Knowing the pros and cons to these technologies will help.
Data Modelling - A side project is a great way of showing your data modelling skills, using the right data modelling techniques depending on your data and being able to explain why it works best.
With a firm understanding in the basics of all the above, you should be comfortable to begin a project. The next challenge is usually deciding what you should actually create!
As well as the basics above, you could take a look at what tools are currently trending and have a go at using them as part of your project. You can even look at job descriptions to get an idea what a particular company is looking for.
Writing imperfect code is how to start. This was what got myself overcoming the self limiting mindset. Write now, refactor later. From my experience, creating a project is an iterative process and you will learn new things on the way. With that said, don’t wait to master all the things you might want to use.
Planning A Project
These are some of the key questions you will want to think about when planning a new project:
What have you always wanted to automate that is a part of your daily life?
How will you extract the source data - is there an API or will you need to scrape the data?
Where will it be stored? Can you use cloud storage like S3?
Is there a docker image you can use, e.g. for orchestration tools or visualisation tools like Superset
What third party tools can you use for data replication?
What will you use the data for once you store it somewhere?
Let’s go into some detail on these points, first thinking about the motivation for a project.
If you’re looking to work in a specific industry, you could create a project that involves data analysis and creating visualisations that provides insights. For example using twitter data to gauge the sentiment on a product or market data to show how inflation has changed the prices of goods and services. If you have come across similar data engineering projects, then try to recreate in your own way.
A great example is Start Data Engineering’s sample batch project that utilises an ETL workflow and open source tools like Dagster to create a dashboard using fake data. I recommend taking a look at how they have set up the infrastructure for automation , deploying to the cloud and designing the project. You can then try using it with your own data and changing the code to write your own transformation pipelines.
You could also make something personal related to your interests. Maybe you want to automate an alert when an item is back in stock or you want to create a dashboard of your own movie reviews and ratings. A project I started relates to managing my personal finances as my bank has a public API. At first I began writing a Python script to extract my bank account data using a wrapper I found on Github that makes calling the API in Python easy. I then wanted to automate this and put my data somewhere, so I had to think about the tools to use.
Tools And Frameworks
Over time I expanded on my project by integrating an orchestration tool - I wanted to learn how to set up Airflow with Docker and use the TaskFlow API style to write simple data pipelines. Docker Compose made deploying Airflow easy by using Airflow’s official Docker image which handles the meta database and connecting all the services on the same network.
As I wanted my data to eventually reach my Notion page, I decided setup a database in the cloud and deployed the project on a virtual machine on AWS. I would always recommend getting comfortable with cloud resources as most work is now done in the cloud anyway and it is going to help with your job prospects. I was then able to use a third party tool to replicate the data onto my Notion page, which made my personal finances super easy to track and automate!
I hope this article has inspired you to begin the project you’ve always wanted to do, with an idea of the tools to use and a direction to go in. A simple project can evolve into something greater and the learning will be worth the time.
Thanks for reading Modern Data Engineering! Subscribe for free to receive new posts and support my work.