Books for software engineers and managers

Data Pipelines Pocket  Reference

Moving and Processing Data for  Analytics

by James Densmore

Categories:
Star Engineer

In modern data pipelines the common ETL pattern has been reordered to ELT, which reduces guessing about what data analysts will use the data  for

The benefit of moving data transformations from pre-load to post-load is that data engineers no longer need to guess what data analysts want to query. Likewise, data engineers don’t need to get everything correct upfront, which reduces the barrier to building a robust data warehouse.

SQL is still the most important skill for data  analysis

Data analysts need strong SQL skills. I would also argue that software engineers all benefit from learning SQL because SQL teaches you to think about data in sets, not just individual records.

Testing data pipelines feels like smoke testing software - mostly informal sanity checks for the presence of data and some  accuracy

The author provides a framework for testing data pipelines, but the framework amounts to mostly smoke testing. You’re testing for:

  1. The existence of data
  2. Duplicate rows
  3. Values beyond expected boundaries

I don’t know what a more formal approach would look like, but it seems like data engineering could learn from software engineering’s approach to unit and integration tests.

Never assume the data is free of quality issues, even if exactly matching the  source

Applications have bugs and sometimes our source data has errors. So even beyond potential issues with the ELT process, you don’t want to assume that the data in your warehouse is free of quality issues.

Pipelines are directed and acyclic, meaning they have a guaranteed execution path and only flow one  way

Because data pipelines only move one direction, error correction can be tricky and manual. Assuming you even have monitoring and validation steps in place to know when an error occurs.

Data Pipelines Pocket Reference

How strongly do I recommend Data Pipelines Pocket  Reference?
5 / 10

With so much industry momentum toward data engineering and downstream uses for AI, I picked up this book hoping to pick up skills complementary to my software engineering knowledge.

What I found was that tooling has really changed over the past decade with the emergence of AWS and providers like Snowflake and Fivetran. But the basics are all the same and pretty straightforward for most companies – export from one system, import into another, then use SQL for data analysis.

The code examples are simple and illustrate the mechanics of setting up pipelines using SQL, python, AWS, and various Apache tools. The SQL is very straightforward and mostly demonstrates the usefulness of GROUP BY. The python is only slightly more complex, demonstrating connections to services like Snowflake and AWS S3.

Overall this book gave me confidence that I’m not missing out on too much and could setup a reasonable data pipeline without too much effort.