Books for software engineers and managers

Data Pipelines Pocket Reference

Data Pipelines Pocket  Reference

Moving and Processing Data for  Analytics

by James Densmore

Categories:
Tech Lead,
Star Engineer

How strongly do I recommend Data Pipelines Pocket  Reference?
5 / 10

Review of Data Pipelines Pocket Reference

With so much industry momentum toward data engineering and downstream uses for AI, I picked up this book hoping to pick up skills complementary to my software engineering knowledge.

What I found was that tooling has really changed over the past decade with the emergence of AWS tools for data engineers and providers like Snowflake and Fivetran. But the basics are all the same and pretty straightforward for most companies – export from one system, import into another, then use SQL for data analysis.

The code examples are simple and illustrate the mechanics of setting up pipelines using SQL, python, AWS, and various Apache tools. The SQL is very straightforward and mostly demonstrates the usefulness of GROUP BY. The python is only slightly more complex, demonstrating connections to services like Snowflake and AWS S3.

Overall this book gave me confidence that I’m not missing out on too much and could setup a reasonable data pipeline without too much effort.



In modern data pipelines the common ETL pattern has been reordered to ELT, which reduces guessing about what data analysts will use the data  for

The benefit of moving data transformations from pre-load to post-load is that data engineers no longer need to guess what data analysts want to query. Likewise, data engineers don’t need to get everything correct upfront, which reduces the barrier to building a robust data warehouse.

SQL is still the most important skill for data  analysis

Data analysts need strong SQL skills. I would also argue that software engineers all benefit from learning SQL because SQL teaches you to think about data in sets, not just individual records.

Testing data pipelines feels like smoke testing software - mostly informal sanity checks for the presence of data and some  accuracy

The author provides a framework for testing data pipelines, but the framework amounts to mostly smoke testing. You’re testing for:

  1. The existence of data
  2. Duplicate rows
  3. Values beyond expected boundaries

I don’t know what a more formal approach would look like, but it seems like data engineering could learn from software engineering’s approach to unit and integration tests.

Never assume the data is free of quality issues, even if exactly matching the  source

Applications have bugs and sometimes our source data has errors. So even beyond potential issues with the ELT process, you don’t want to assume that the data in your warehouse is free of quality issues.

Pipelines are directed and acyclic, meaning they have a guaranteed execution path and only flow one  way

Because data pipelines only move one direction, error correction can be tricky and manual. Assuming you even have monitoring and validation steps in place to know when an error occurs.

Address data quality issues as early in the pipeline as  possible

Although the common data ingestion pattern has shifted from ETL to ELT, thereby deferring the Transformation component, if you identify a data quality issue and can address it early that is still the best approach. Don’t burden your analytics people with unnecessary transformations if you can cleanup the data earlier on.

Encountering data quality issues at the end of the pipeline or during the analysis stage is a worst case scenario.

Data engineers should never assume the data is clean and free of quality  issues

Even if data in a warehouse exactly matches the source data, don’t assume that the data is clean. The source data itself may have quality issues as a result of software bugs or data corruption, which then cascade their way into your data warehouse.

Data Pipelines Pocket Reference