The Grouparoo Blog
Last month, we decided that we should all read a book and talk about it as a company. It was a fun experience and I think we made a good choice by picking 97 Things Every Data Engineer Should Know.
This was the first book I have read in this series and I liked the format. It is made up of 97 small vignettes that are 2-3 pages each. This provided a nice overview of the breadth of topics that are relevant to data engineering including data warehouses/lakes, pipelines, metadata, security, compliance, quality, and working with other teams.
Themes
I was drawn to the articles that speak to a theme in the data world that I am passionate about: how data pipelines and data team practices are evolving to be more like traditional product development.
Reproducible pipelines
- Automate Your Infrastructure by Christiano Anderson
- Data Pipeline Design Patterns for Reusability and Extensibility by Mukul Sood
- Engineering Reproducible Data Science Projects by Dr. Tianhui Michael Li
- The Three Rs of Data Engineering by Tobias Macey
Data testing and quality
- Automate Your Pipeline Tests by Tom White
- Data Quality for Data Engineers by Katharine Jarmul
- Data Validation Is More Than Summary Statistics by Emily Riederer
- The Six Words That Will Destroy Your Career by Bartosz Mikulski
- Your Data Tests Failed! Now What? by Sam Bail, PhD
Agile development and product management
- Caution: Data Science Projects Can Turn into the Emperor’s New Clothes by Shweta Katre
- Cultivate Good Working Relationships with Data Consumers by Ido Shlomo
- Demystify the Source and Illuminate the Data Pipeline by Meghan Kwartler
- How to Build Your Data Platform like a Product by Barr Moses and Atul Gupte
- Listen to Your Users—but Not Too Much by Amanda Tomlinson
- Tech Should Take a Back Seat for Data Project Success by Andrew Stevenson
- Ten Must-Ask Questions for Data-Engineering Projects by Haidar Hadi
- What to Do When You Don’t Get Any Credit by Jesse Anderson
- When to Talk and When to Listen by Steven Finkelstein
Feedback
There were a few things that we noticed that could be improved.
The articles are in alphabetical order. I believe it would have been better if they would have had some groupings or take the reader on an arc of some sort. For example, grouping the ones about metadata, discoverability, and column naming might have made a lot of sense.
I read the old-fashioned hard copy, but I was told by people using the Kindle version that the author pictures were of random size. I assume it was the size the authors sent in. This varied from very small to taking over the whole page, creating a disjointed experience. I don't know how such things work, but I feel like Latex might be involved.
Notes
I took short notes on the top of each article about it and then copied them to a spreadsheet. Like any good data engineer.
# | Title | Notes |
---|---|---|
1 | A (Book) Case for Eventual Consistency | Strong vs eventual consistency |
2 | A/B and How to Be | Most are wrong. If it's working, be skeptical. Test system with A/A test. |
3 | About the Storage Layer | Efficiency details for queries |
4 | Analytics as the Secret Glue for Microservice Architectures | What to measure: company metrics, team metrics, experiment metrics |
5 | Automate Your Infrastructure | DevOps is good |
6 | Automate Your Pipeline Tests | Treating data engineering like software engineering. Open question: how to seed data in a staging environment? |
7 | Be Intentional About the Batching Model in Your Data Pipelines | Different batching models. Could we do better for Grouparoo? |
8 | Beware of Silver-Bullet Syndrome | Do not build your professional identity on a specific toolset. Be adaptable. |
9 | Building a Career as a Data Engineer | Skills: experience on software lifecyle, SQL, open source |
10 | Business Dashboards for Data Pipelines | Dashboard and graphics help data quality |
11 | Caution: Data Science Projects Can Turn into the Emperor’s New Clothes | Projects: iterate, provide visibility, env for rapid changes, share scripts |
12 | Change Data Capture | Should Grouparoo use the WAL or other native CDC approaches? We handle the "_deleted" table approach already. |
13 | Column Names as Contracts | Standardize columns names to minimize confusion |
14 | Consensual, Privacy-Aware Data Collection | At some point does Grouparoo get properties noted as PII and what it means for a profile to opt out? What does that do? |
15 | Cultivate Good Working Relationships with Data Consumers | Practice empathy |
16 | Data Engineering != Spark | Data eng = Computation + Storage + Messaging + Coding + Architecture + Domain Knowledge + Use Cases |
17 | Data Engineering for Autonomy and Rapid Innovation | Sounds like the case of ELT |
18 | Data Engineering from a Data Scientist’s Perspective | Data engineering has gotten more complex recently |
19 | Data Pipeline Design Patterns for Reusability and Extensibility | Software design patterns apply to data engineering |
20 | Data Quality for Data Engineers | Implement common sense tests for data quality. What would that look like? |
21 | Data Security for Data Engineers | Think about the security of data |
22 | Data Validation Is More Than Summary Statistics | Quality testing requires context |
23 | Data Warehouses Are the Past, Present, and Future | Warehouses keep evolving to meet users needs |
24 | Defining and Managing Messages in Log-Centric Architectures | Standardize message definitions in an evented system |
25 | Demystify the Source and Illuminate the Data Pipeline | Learn more about the sources of your data |
26 | Develop Communities, Not Just Code | Think about creating a data culture, not just a pipeline |
27 | Effective Data Engineering in the Cloud World | There are lots of pieces to work with these days |
28 | Embrace the Data Lake Architecture | Data lakes are scalable |
29 | Embracing Data Silos | Maybe it's not always right to get your data into one place. If so, find a way to abstract the silos to have one way to access it all. |
30 | Engineering Reproducible Data Science Projects | Follow engineering practices to have more dependable proejcts |
31 | Five Best Practices for Stable Data Processing | Rollback on error, keep data consistent |
32 | Focus on Maintainability and Break Up Those ETL Tasks | Do one step per transform to maintain simplicity |
33 | Friends Don’t Let Friends Do Dual-Writes | Use CDC events to write once and then chain to dependencies. |
34 | Fundamental Knowledge | Knowledge of fundamental concepts allows you to embrace change |
35 | Getting the “Structured” Back into SQL | Tips on writing SQL. |
36 | Give Data Products a Frontend with Latent Documentation | Document more to help everyone |
37 | How Data Pipelines Evolve | Build ELT at mid-range and move to data lakes when you need scale |
38 | How to Build Your Data Platform like a Product | PM your data with business. Increase visibility. |
39 | How to Prevent a Data Mutiny | Key trends: modular architecture, declarative configuration, automated systems |
40 | Know the Value per Byte of Your Data | Check if you are actually using your data |
41 | Know Your Latencies | key questions: how old is data? how fast are queries? how many concurrent queries can we handle? |
42 | Learn to Use a NoSQL Database, but Not like an RDBMS | Write answers to questions in NoSQL databases for fast access |
43 | Let the Robots Enforce the Rules | Work with people to standardize and use code to enforce rules |
44 | Listen to Your Users—but Not Too Much | Create a data team vision and strategy. Take requests and see how they fit into that. |
45 | Low-Cost Sensors and the Quality of Data | Order redundant equipment |
46 | Maintain Your Mechanical Sympathy | Sometimes it helps to understand underlying physics |
47 | Metadata ≥ Data | Plan your data strategy early and make discovery easy |
48 | Metadata Services as a Core Component of the Data Platform | Metadata helps discovery, security, and agility |
49 | Mind the Gap: Your Data Lake Provides No ACID Guarantees | Lakes are not databases |
50 | Modern Metadata for the Modern Data Stack | Metadata helps collaboration |
51 | Most Data Problems Are Not Big Data Problems | Most problems are best solved with a relational database |
52 | Moving from Software Engineering to Data Engineering | Switching from product eng to data eng can. be fun and exciting |
53 | Observability for Data Engineers | Pillars of discoverability: freshness, distribution, volume, schema, lineage. "Lineage" sounds useful for Grouparoo. |
54 | Perfect Is the Enemy of Good | Make MVPs and iterate. |
55 | Pipe Dreams | Kafka was good because it had replaying of messages. |
56 | Preventing the Data Lake Abyss | Use data contracts and tools (Apache Aurora or Google Protocol Buffers) to keep lakes under control |
57 | Prioritizing User Experience in Messaging Systems | Realtime data messaging creates better experiences |
58 | Privacy Is Your Problem | You can often still identify people even when PII is removed |
59 | QA and All Its Sexiness | Testing and QA is good. There are two types: practical and logical. |
60 | Seven Things Data Engineers Need to Watch Out for in ML Projects | Top issue: misunderstanding what a data attribute means. |
61 | Six Dimensions for Picking an Analytical Data Warehouse | Think about scalability, how it's priced, maintenance, and speed. |
62 | Small Files in a Big Data World | Having many small files on a system leads to wacky errors |
63 | Streaming Is Different from Batch | You have to think about things differently when streaming instead of batching. |
64 | Tardy Data | Consider adding meta data column for storage: arrival_time of data to know to "go back" and process it. |
65 | Tech Should Take a Back Seat for Data Project Success | Focus on self-service and engaging business users to drive successful projects |
66 | Ten Must-Ask Questions for Data-Engineering Projects | Understand project parameters before you code |
67 | The Data Pipeline Is Not About Speed | Parallelization is now more important because of cloud horizontal scaling |
68 | The Dos and Don’ts of Data Engineering | Do DataOps to make things more reliable and agile, less heroic. |
69 | The End of ETL as We Know It | Use events from the product to notify data systems of changes. |
70 | The Haiku Approach to Writing Software | Understand constraints, start strong, keep it simple, and be creative. |
71 | The Hidden Cost of Data Input/Output | Storage choices impact performance. |
72 | The Holy War Between Proprietary and Open Source Is a Lie | Use tools that are best for your project and stay out of cargo cults. |
73 | The Implications of the CAP Theorem | Most common trade-off: Speed vs. consistency across nodes. |
74 | The Importance of Data Lineage | Tracking lineage help answer questions when things go wrong. |
75 | The Many Meanings of Missingness | There are several reasons for a null value. It could be "correct" or an error. |
76 | The Six Words That Will Destroy Your Career | You lose credibility when the data is wrong. Test and monitor to keep it right. |
77 | The Three Invaluable Benefits of Open Source for Testing Data Quality | Use open source tools to maintain data quality |
78 | The Three Rs of Data Engineering | Data needs to be reliable. Other engineers must be able to reproduce your results. Build repeatable infrastructure. |
79 | The Two Types of Data Engineering and Data Engineers | Two types of data engineers: SQL (relational databases) and big data (python, hadoop) |
80 | The Yin and Yang of Big Data Scalability | Complex systems have many knows to be tuned to maximize throughput. |
81 | Threading and Concurrency in Data Processing | You might hit OS limits when scaling servers |
82 | Three Important Distributed Programming Concepts | Concepts: Map/Reduce (Spark, Hadoop), shared memory (Redis), message passing (Kafka) |
83 | Time (Semantics) Won’t Wait | In event stream processing, there are tradeoffs between completeness and latency. Look into watermarks to control. |
84 | Tools Don’t Matter, Patterns and Practices Do | Focus on concepts, not tools. Ask "why" questions about new concepts to learn. |
85 | Total Opportunity Cost of Ownership | Going all in a tool or paradigm might create problems as tech evolves. |
86 | Understanding the Ways Different Data Domains Solve Problems | Data science, infra, and eng teams have different goals and mindsets that influence their approach. |
87 | What Is a Data Engineer? Clue: We’re Data Science Enablers | Data engineers and scientists can work together to produce better results |
88 | What Is a Data Mesh, and How Not to Mesh It Up | You can have a data lake and many pipelines used by different business domains. |
89 | What Is Big Data? | Stay away from hype. Just get the job done. |
90 | What to Do When You Don’t Get Any Credit | To get credit, talk in terms of business value, not technology |
91 | When Our Data Science Team Didn’t Produce Value | Balance long-term solutions with short-term needs |
92 | When to Avoid the Naive Approach | Storage format and schema are good to get right from the beginning. |
93 | When to Be Cautious About Sharing Data | Maybe everyone shouldn't have access to data that requires expertise to interpret. |
94 | When to Talk and When to Listen | Smaller scope helps get things shipped more quickly. |
95 | Why Data Science Teams Need Generalists, Not Specialists | Specialization can slow things down. Lean towards full stack ownership. |
96 | With Great Data Comes Great Responsibility | Consider ethics while building data pipelines. |
97 | Your Data Tests Failed! Now What? | There are many possible reasons for a failed test. |
Tagged in Company
See all of Brian Leonard's posts.
Brian is the CEO and co-founder of Grouparoo, an open source data framework that easily connects your data to business tools. Brian is a leader and technologist who enjoys hanging out with his family, traveling, learning new things, and building software that makes people's lives easier.
Learn more about Brian @ https://www.linkedin.com/in/brianl429
Get Started with Grouparoo
Start syncing your data with Grouparoo Cloud
Start Free TrialOr download and try our open source Community edition.