How strongly do I recommend Data Pipelines Pocket Reference?
5 / 10
With so much industry momentum toward data engineering and downstream uses for AI, I picked up this book hoping to pick up skills complementary to my software engineering knowledge.
What I found was that tooling has really changed over the past decade with the emergence of AWS tools for data engineers and providers like Snowflake and Fivetran. But the basics are all the same and pretty straightforward for most companies – export from one system, import into another, then use SQL for data analysis.
The code examples are simple and illustrate the mechanics of setting up pipelines using SQL, python, AWS, and various Apache tools. The SQL is very straightforward and mostly demonstrates the usefulness of GROUP BY. The python is only slightly more complex, demonstrating connections to services like Snowflake and AWS S3.
Overall this book gave me confidence that I’m not missing out on too much and could setup a reasonable data pipeline without too much effort.
Top Ideas in This Book
The benefit of moving data transformations from pre-load to post-load is that data engineers no longer need to guess what data analysts want to query. Likewise, data engineers don’t need to get everything correct upfront, which reduces the barrier to building a robust data warehouse.
Data analysts need strong SQL skills. I would also argue that software engineers all benefit from learning SQL because SQL teaches you to think about data in sets, not just individual records.
The author provides a framework for testing data pipelines, but the framework amounts to mostly smoke testing. You’re testing for:
I don’t know what a more formal approach would look like, but it seems like data engineering could learn from software engineering’s approach to unit and integration tests.
Applications have bugs and sometimes our source data has errors. So even beyond potential issues with the ELT process, you don’t want to assume that the data in your warehouse is free of quality issues.
Because data pipelines only move one direction, error correction can be tricky and manual. Assuming you even have monitoring and validation steps in place to know when an error occurs.
Although the common data ingestion pattern has shifted from ETL to ELT, thereby deferring the Transformation component, if you identify a data quality issue and can address it early that is still the best approach. Don’t burden your analytics people with unnecessary transformations if you can cleanup the data earlier on.
Encountering data quality issues at the end of the pipeline or during the analysis stage is a worst case scenario.
Even if data in a warehouse exactly matches the source data, don’t assume that the data is clean. The source data itself may have quality issues as a result of software bugs or data corruption, which then cascade their way into your data warehouse.