Hey there!👋 It’s Elias - let’s talk about testing in data engineering. This will be fun and quick.
I believe we can all agree that writing tests can feel as mundane as doing the laundry as most of us would probably rather be coding new features or learning something new. We would also agree that testing is undeniably a critical component of data engineering, setting individuals apart by the skill it demands.
Many data teams these days lack the basic ability to test their data pipelines, which include unit testing code and data quality testing.
Given the importance, it’s easy to fall into the trap of writing tests just for the sake of it, especially if we don’t see it’s true value.
There are some great frameworks out there to make data quality testing easy and you can read more about them in my previous post below.
Testing Frameworks
When it comes to building a data pipeline, we often have complex or dynamic requirements. Adopting the software engineers’s time-tested technique of Test Driven Development (TDD) can be challenging as it demands writing tests for specific functionality and then adjusting the code accordingly to ensure it is functional and dependable. This is because you’ll need to know the individual components that you want to test beforehand.
There is also Behaviour Driven Development (BDD) which encourages more teamwork and starts with a focus on the user story to bridge the gap between business and technical team members.
In the ideal world our JIRA tickets would have low level requirements that easily translate into individual unit tests that can be created. However in reality you could have high level requirements to build a solution where you don’t know all the individual technical components needed until you start.
You might start writing some code and then refine it over time as the architecture of your data influences the design or requirements change. This is where the idea of writing clean performant code comes in, so that adding tests after isn’t a refactoring nightmare - modular functions that do one thing and avoid having side effects.
Unit Testing
Why do we need unit tests?
Writing testable code is crucial to ensure that your code is reliably executing as expected
It reduces the time spent fixing bugs as they’re caught sooner through automation
Data pipelines can contain complex business logic that will need to be checked against different inputs
Unit testing scrutinises the individual components of your code to identify issues at the most granular level
Embracing unit tests boosts developer performance and benefits the team too
pytest
For a while, I've been using pytest for unit tests in Python, but only recently deep dived into how it really works. I noticed how un-pythonic pytest is and how this makes testing a breeze.
It has a lot of built-in features to make testing frictionless. When assertions fail, the feedback on the failures are expressive and helps you iterate faster. You can also create fixtures which will run a separate function every time one of the tests are being run, providing more flexibility and speed compared to the unittest module. There is also much less boilerplate code to deal with.
One of the common use cases of fixtures is to create and destroy environments which are required as part of the test being ran. For example, before the test is ran, it can create input data required for a test or mock up an S3 bucket.
This is an improvement over the unittest method as you now only have to define the fixture in one place, the conftest.py file which is stored in your ‘tests’ directory, and it can be used against any of your test functions.
It’s also very easy to filter which tests you want to run with the commands that come with pytest.
Let’s see an example where we define a simple function to work out a new discounted price and use fixtures to hold input data.
In our conftest.py file, we add a fixture by using the decorator @pytest.fixture before defining the function.
In our test file, we can now write test functions that make use of the product_prices fixture. This is where pytest deviates from regular Python style and it somehow just works!
By including product_prices in the argument of the test_calculate_discount function, pytest will first run the fixture to return the product_prices dictionary. This is then used in the calculate_discount function to provide a price and discount rate.
I can run this test from the command line by simply running “pytest” or I can specify the file and function by including the name of the test file and the function name: “pytest test_pytest_functions -k test_calculate_discount”.
Another way of providing input data to a function would be to use the @pytest.mark.parametrize function. This means you can hardcode specific inputs for certain functions that should not use the product_prices fixture.
Closing Thoughts
I hope this gives a quick feel of the power of pytest and why it might be a better choice than other tools out there with the time saving benefits and flexibility.
There is no one solution fits all when it comes to testing and I’m sure there’s much more to explore with pytest, maybe for another post.
Whether you embrace the frameworks for testing or not, the more important thing is that there are some tests. This should be a good starting place if you’ve not yet started testing your code, or just a refresher!
Thanks Elias for the useful content. Any speific recommendation or materials for pytest?