Breaking into Data Engineering as a Self-Taught Developer

From data analyst to data engineer: why I made the transition

May 30, 2023

Source: https://unsplash.com/photos/tPKQwYHy8q4

The growing need for technical skills

In the ever-evolving field of data, many analytics roles offer growth, ownership and innovation opportunities. Data engineers play a pivotal role in designing, building, and maintaining systems that enable organisations to effectively leverage their data. They work closely with data scientists to ensure that data is properly stored, secured, and made available for analysis. They possess an extensive skill set which include knowledge of databases, data modelling, ETL processes and programming languages like SQL and Python. Aside form the core concepts that a data engineer should be familiar with, there are also a plethora of tools out there to work with.

The trend of increasing demand for technical expertise is paralleled by a surge in the demand for data engineers. Companies recognise the value of data and its role in decision-making. They are seeking data engineers to handle the growing complexity of data infrastructure. Over the past few years, I transitioned into a data engineering role as I became curious about this growth.

Starting as a data analyst

At university I studied core Physics and in my final year I completed a data analysis module covering database fundamentals. After graduating, I came across junior data analyst roles which required little to no experience. I landed my first role in an informaiton services company where I mostly generated reports and dashboards. It was a great first job to see how businesses operate, where I learned the art of storytelling and the basics of data warehousing.

Collaborating with data engineers allowed me to gain a deeper understanding of the data pipelines and the importance of data quality. Working with data scientists provided me with exposure to advanced analytical techniques and models. I assisted them in the preparation and preprocessing of data, and in the evaluation of model results. This experience helped me expand my knowledge in statistical analysis and data visualisation, enabling me to present findings in a clear and meaningful way to stakeholders.

This role served as a valuable stepping stone in my career providing me with practical experience in a real business setting. The exposure to both data engineering and data science aspects broadened my skill set and deepened my understanding of the end-to-end data lifecycle. This left me eager to further develop my technical skills and take on more challenging opportunities.

At the time, companies sought to develop their predictive capabilities and the demand for data-driven decision-making in businesses paved the road to hiring data scientists en masse. There were a lot of people from many different educational backgrounds looking to become a data scientist and honestly, I also wanted to have the sexiest job title of the time.

My decision go down the route of data science stemmed from a desire to automate and improve my analytical skills. This realisation led me to a transition into a consultancy role as a data scientist, giving me an opportunity to gain proficiency in Python and further explore its applications in data analysis.

My experience in data science

For the majority of the time in my consultancy role I was placed on a client project with a multinational conglomerate to support their campaign deliveries. I had great mentors who assisted me with obtaining my cloud certifications and developing interesting proof of concepts. They also helped me with understanding the clients we worked with to better tailor our model building to the domain of our client. For example, one of the clients we worked with wanted to gauge the sentiment of people’s views on the envinronment. Seasonality was a key driver in how people talked about the environment and so was important to consider during the model building.

I spent my development time working on my Python and R skills, participating in machine learning working groups and learning about robust coding best practices. However, most of my project work wasn’t very stimulating, in the sense that the work had little impact. Most of the time I didn’t feel like a data scientist, but more of a glorified data analyst. It is true that a lot of the time will always need to be spent on cleaning data and preparing it, but I was expecting to also work on some real-world machine learning problems. The lack of stimulating projects led to me yearning for more substantial technical challenges.

We were often seen as the magicians by our colleagues who frequently requested assistance with their data collection, usually to do some simple regex cleaning in VBA on a free-text field that was used in a form. This was awesome for them, but not very interesting for me. Well, now I’m currently working on moving some data pipelines I’ve built from Airflow onto Mage, and that kind of makes me a magician now.

At the time, machine learning wasn’t always the solution to every problem. I often struggled to see how to add real business value. The push to create proof of concepts for forecasting or anomaly detection applications sometimes went nowhere because of one of the following:

Lack of scalable infrastructure
Wrong expectations
Poor data quality

When it comes to scalable infrastructure, there are multiple factors at play. One of the common issues was data scalability and the costs that come with expensive querying. This was partly due to repeated queries, often derived from questions that are business oriented. As the performance noticeably suffered to the point where the team couldn’t work effectively, then it became clear that the data infrastructure was not scalable.

With the hype around the potential of machine learning, I often encountered ambitious requests from stakeholders or those less familiar with the details of the technical implementation. This mislagnement with expectations wasted time that could be better spent on other tasks. Being relatively new to the field myself, I was also still learning to make decisions about what’s worth exploring and what’s actually possible. With these key issues, my focus naturally moved to wanting to understand them better and improve them.

Moving to data engineering

As my programming skills improved I moved to an organisation where I joined a small data science team, with a focus on building internal tools and productionising models. I designed and implemented tools that streamlined report building processes to allow end users to generate and access access report data quicker.

My primary responsibility was to identify bottlenecks faced by analysts when accessing and working with data. This then empowered analysts to spend less time on manual data retrieval and processing tasks and dedicate more of their time to data analysis, generating market insights, and making informed business decisions. I also focused on data quality, integrating tools like Great Expectations into our batch data pipelines so that we could catch data issues before it hit downstream processes. It also allowed us to confidently extract data for use by the data scientists to build machine learning models. I then built processes to take these models and apply it to our data. My work began to feel more valued, as the the tools I built significantly improved the productivity of data analysts.

Having been an analyst myself first, I was aware of the common data issues and struggles faced when working with data. Becoming proficient in how to build pipelines and extract data gave me much more independence to tackle these issues. With this opportunity, I learned many key skills:

Building idempotent data pipelines
Data quality testing
Creating Python packages with tested code
Docker and docker compose to run things anywhere
Using the linux terminal

Finding the right team

My career plan at this point was clearer to me as I started to realise that there is so much more to learn in data engineering:

I started looking at many projects to unpack common design patterns
I wanted to understand how to build infrastructure as code, with tools like Terraform
I wanted to learn how to build and interact with REST APIs
I wanted to build things on a larger scale
and much more..

Now that I had expanded my skillset, I wanted to be part of a larger team of developers, where I would be working to a high standard. I moved to my current organisation in the financial sector, which has a matured technology department. Being in a highly regulated department, embracing best practices becomes paramount when building maintainable and testable systems:

Clear documentation throughout the development process, including design documents, data flow diagrams, and code comments.
Rigorous testing, including unit testing, integration testing, and end-to-end testing, to ensure the system functions as intended and meets regulatory requirements.
Implementation of appropriate security measures, including encryption and access controls, to protect sensitive data and ensure compliance with relevant regulations.
Establishment of clear processes and procedures for maintaining and updating the system over time, including version control, change management, and regular auditing.

Being part of a team building critical products from scratch, I’m constantly learning as I deep dive into the many undercurrents of the data engineering lifecycle: security, software engineering, data architecture and more.

Closing Thoughts

Transitioning from data analyst to data engineer consisted of acquiring technical skills and finding the right organisation that fostered continuous learning and opportunities. As part of my journey, I dedicated time outside of work to learn engineering design principles and concepts that are applicable to data engineering.

Aspiring data engineers can find further insights by exploring resources like Joe Reis and Matt Housley’s book, “Fundamentals of Data Engineering,” which gave me a comprehensive understanding of the data engineering role and its intricacies. The authors devide the data engineering lifecycle into the fives stages of:

Generation
Storage
Ingestion
Transformation
Serving

With a deep dive into each stage, you will learn how they connect and gain a holistic overview of what data engineering encompasses. I find it really helpful to refer to it when I am working on projects and need a reminder of best practices. As data engineering continues to evolve, it is crucial to stay curious, embrace new technologies, and never stop learning.

Thank you for reading Modern Data Engineering. This post is public so feel free to share it.

Modern Data Engineering