The Grouparoo Blog

97 things every data engineer should know

Tagged in Company
By Brian Leonard on 2021-10-07

Last month, we decided that we should all read a book and talk about it as a company. It was a fun experience and I think we made a good choice by picking 97 Things Every Data Engineer Should Know.

This was the first book I have read in this series and I liked the format. It is made up of 97 small vignettes that are 2-3 pages each. This provided a nice overview of the breadth of topics that are relevant to data engineering including data warehouses/lakes, pipelines, metadata, security, compliance, quality, and working with other teams.

97 Things Every Data Engineer Should Know

Themes

I was drawn to the articles that speak to a theme in the data world that I am passionate about: how data pipelines and data team practices are evolving to be more like traditional product development.

Reproducible pipelines

Automate Your Infrastructure by Christiano Anderson
Data Pipeline Design Patterns for Reusability and Extensibility by Mukul Sood
Engineering Reproducible Data Science Projects by Dr. Tianhui Michael Li
The Three Rs of Data Engineering by Tobias Macey

Data testing and quality

Automate Your Pipeline Tests by Tom White
Data Quality for Data Engineers by Katharine Jarmul
Data Validation Is More Than Summary Statistics by Emily Riederer
The Six Words That Will Destroy Your Career by Bartosz Mikulski
Your Data Tests Failed! Now What? by Sam Bail, PhD

Agile development and product management

Caution: Data Science Projects Can Turn into the Emperor’s New Clothes by Shweta Katre
Cultivate Good Working Relationships with Data Consumers by Ido Shlomo
Demystify the Source and Illuminate the Data Pipeline by Meghan Kwartler
How to Build Your Data Platform like a Product by Barr Moses and Atul Gupte
Listen to Your Users—but Not Too Much by Amanda Tomlinson
Tech Should Take a Back Seat for Data Project Success by Andrew Stevenson
Ten Must-Ask Questions for Data-Engineering Projects by Haidar Hadi
What to Do When You Don’t Get Any Credit by Jesse Anderson
When to Talk and When to Listen by Steven Finkelstein

Feedback

There were a few things that we noticed that could be improved.

The articles are in alphabetical order. I believe it would have been better if they would have had some groupings or take the reader on an arc of some sort. For example, grouping the ones about metadata, discoverability, and column naming might have made a lot of sense.

I read the old-fashioned hard copy, but I was told by people using the Kindle version that the author pictures were of random size. I assume it was the size the authors sent in. This varied from very small to taking over the whole page, creating a disjointed experience. I don't know how such things work, but I feel like Latex might be involved.

Notes

I took short notes on the top of each article about it and then copied them to a spreadsheet. Like any good data engineer.

#	Title	Notes
1	A (Book) Case for Eventual Consistency	Strong vs eventual consistency
2	A/B and How to Be	Most are wrong. If it's working, be skeptical. Test system with A/A test.
3	About the Storage Layer	Efficiency details for queries
4	Analytics as the Secret Glue for Microservice Architectures	What to measure: company metrics, team metrics, experiment metrics
5	Automate Your Infrastructure	DevOps is good
6	Automate Your Pipeline Tests	Treating data engineering like software engineering. Open question: how to seed data in a staging environment?
7	Be Intentional About the Batching Model in Your Data Pipelines	Different batching models. Could we do better for Grouparoo?
8	Beware of Silver-Bullet Syndrome	Do not build your professional identity on a specific toolset. Be adaptable.
9	Building a Career as a Data Engineer	Skills: experience on software lifecyle, SQL, open source
10	Business Dashboards for Data Pipelines	Dashboard and graphics help data quality
11	Caution: Data Science Projects Can Turn into the Emperor’s New Clothes	Projects: iterate, provide visibility, env for rapid changes, share scripts
12	Change Data Capture	Should Grouparoo use the WAL or other native CDC approaches? We handle the "_deleted" table approach already.
13	Column Names as Contracts	Standardize columns names to minimize confusion
14	Consensual, Privacy-Aware Data Collection	At some point does Grouparoo get properties noted as PII and what it means for a profile to opt out? What does that do?
15	Cultivate Good Working Relationships with Data Consumers	Practice empathy
16	Data Engineering != Spark	Data eng = Computation + Storage + Messaging + Coding + Architecture + Domain Knowledge + Use Cases
17	Data Engineering for Autonomy and Rapid Innovation	Sounds like the case of ELT
18	Data Engineering from a Data Scientist’s Perspective	Data engineering has gotten more complex recently
19	Data Pipeline Design Patterns for Reusability and Extensibility	Software design patterns apply to data engineering
20	Data Quality for Data Engineers	Implement common sense tests for data quality. What would that look like?
21	Data Security for Data Engineers	Think about the security of data
22	Data Validation Is More Than Summary Statistics	Quality testing requires context
23	Data Warehouses Are the Past, Present, and Future	Warehouses keep evolving to meet users needs
24	Defining and Managing Messages in Log-Centric Architectures	Standardize message definitions in an evented system
25	Demystify the Source and Illuminate the Data Pipeline	Learn more about the sources of your data
26	Develop Communities, Not Just Code	Think about creating a data culture, not just a pipeline
27	Effective Data Engineering in the Cloud World	There are lots of pieces to work with these days
28	Embrace the Data Lake Architecture	Data lakes are scalable
29	Embracing Data Silos	Maybe it's not always right to get your data into one place. If so, find a way to abstract the silos to have one way to access it all.
30	Engineering Reproducible Data Science Projects	Follow engineering practices to have more dependable proejcts
31	Five Best Practices for Stable Data Processing	Rollback on error, keep data consistent
32	Focus on Maintainability and Break Up Those ETL Tasks	Do one step per transform to maintain simplicity
33	Friends Don’t Let Friends Do Dual-Writes	Use CDC events to write once and then chain to dependencies.
34	Fundamental Knowledge	Knowledge of fundamental concepts allows you to embrace change
35	Getting the “Structured” Back into SQL	Tips on writing SQL.
36	Give Data Products a Frontend with Latent Documentation	Document more to help everyone
37	How Data Pipelines Evolve	Build ELT at mid-range and move to data lakes when you need scale
38	How to Build Your Data Platform like a Product	PM your data with business. Increase visibility.
39	How to Prevent a Data Mutiny	Key trends: modular architecture, declarative configuration, automated systems
40	Know the Value per Byte of Your Data	Check if you are actually using your data
41	Know Your Latencies	key questions: how old is data? how fast are queries? how many concurrent queries can we handle?
42	Learn to Use a NoSQL Database, but Not like an RDBMS	Write answers to questions in NoSQL databases for fast access
43	Let the Robots Enforce the Rules	Work with people to standardize and use code to enforce rules
44	Listen to Your Users—but Not Too Much	Create a data team vision and strategy. Take requests and see how they fit into that.
45	Low-Cost Sensors and the Quality of Data	Order redundant equipment
46	Maintain Your Mechanical Sympathy	Sometimes it helps to understand underlying physics
47	Metadata ≥ Data	Plan your data strategy early and make discovery easy
48	Metadata Services as a Core Component of the Data Platform	Metadata helps discovery, security, and agility
49	Mind the Gap: Your Data Lake Provides No ACID Guarantees	Lakes are not databases
50	Modern Metadata for the Modern Data Stack	Metadata helps collaboration
51	Most Data Problems Are Not Big Data Problems	Most problems are best solved with a relational database
52	Moving from Software Engineering to Data Engineering	Switching from product eng to data eng can. be fun and exciting
53	Observability for Data Engineers	Pillars of discoverability: freshness, distribution, volume, schema, lineage. "Lineage" sounds useful for Grouparoo.
54	Perfect Is the Enemy of Good	Make MVPs and iterate.
55	Pipe Dreams	Kafka was good because it had replaying of messages.
56	Preventing the Data Lake Abyss	Use data contracts and tools (Apache Aurora or Google Protocol Buffers) to keep lakes under control
57	Prioritizing User Experience in Messaging Systems	Realtime data messaging creates better experiences
58	Privacy Is Your Problem	You can often still identify people even when PII is removed
59	QA and All Its Sexiness	Testing and QA is good. There are two types: practical and logical.
60	Seven Things Data Engineers Need to Watch Out for in ML Projects	Top issue: misunderstanding what a data attribute means.
61	Six Dimensions for Picking an Analytical Data Warehouse	Think about scalability, how it's priced, maintenance, and speed.
62	Small Files in a Big Data World	Having many small files on a system leads to wacky errors
63	Streaming Is Different from Batch	You have to think about things differently when streaming instead of batching.
64	Tardy Data	Consider adding meta data column for storage: arrival_time of data to know to "go back" and process it.
65	Tech Should Take a Back Seat for Data Project Success	Focus on self-service and engaging business users to drive successful projects
66	Ten Must-Ask Questions for Data-Engineering Projects	Understand project parameters before you code
67	The Data Pipeline Is Not About Speed	Parallelization is now more important because of cloud horizontal scaling
68	The Dos and Don’ts of Data Engineering	Do DataOps to make things more reliable and agile, less heroic.
69	The End of ETL as We Know It	Use events from the product to notify data systems of changes.
70	The Haiku Approach to Writing Software	Understand constraints, start strong, keep it simple, and be creative.
71	The Hidden Cost of Data Input/Output	Storage choices impact performance.
72	The Holy War Between Proprietary and Open Source Is a Lie	Use tools that are best for your project and stay out of cargo cults.
73	The Implications of the CAP Theorem	Most common trade-off: Speed vs. consistency across nodes.
74	The Importance of Data Lineage	Tracking lineage help answer questions when things go wrong.
75	The Many Meanings of Missingness	There are several reasons for a null value. It could be "correct" or an error.
76	The Six Words That Will Destroy Your Career	You lose credibility when the data is wrong. Test and monitor to keep it right.
77	The Three Invaluable Benefits of Open Source for Testing Data Quality	Use open source tools to maintain data quality
78	The Three Rs of Data Engineering	Data needs to be reliable. Other engineers must be able to reproduce your results. Build repeatable infrastructure.
79	The Two Types of Data Engineering and Data Engineers	Two types of data engineers: SQL (relational databases) and big data (python, hadoop)
80	The Yin and Yang of Big Data Scalability	Complex systems have many knows to be tuned to maximize throughput.
81	Threading and Concurrency in Data Processing	You might hit OS limits when scaling servers
82	Three Important Distributed Programming Concepts	Concepts: Map/Reduce (Spark, Hadoop), shared memory (Redis), message passing (Kafka)
83	Time (Semantics) Won’t Wait	In event stream processing, there are tradeoffs between completeness and latency. Look into watermarks to control.
84	Tools Don’t Matter, Patterns and Practices Do	Focus on concepts, not tools. Ask "why" questions about new concepts to learn.
85	Total Opportunity Cost of Ownership	Going all in a tool or paradigm might create problems as tech evolves.
86	Understanding the Ways Different Data Domains Solve Problems	Data science, infra, and eng teams have different goals and mindsets that influence their approach.
87	What Is a Data Engineer? Clue: We’re Data Science Enablers	Data engineers and scientists can work together to produce better results
88	What Is a Data Mesh, and How Not to Mesh It Up	You can have a data lake and many pipelines used by different business domains.
89	What Is Big Data?	Stay away from hype. Just get the job done.
90	What to Do When You Don’t Get Any Credit	To get credit, talk in terms of business value, not technology
91	When Our Data Science Team Didn’t Produce Value	Balance long-term solutions with short-term needs
92	When to Avoid the Naive Approach	Storage format and schema are good to get right from the beginning.
93	When to Be Cautious About Sharing Data	Maybe everyone shouldn't have access to data that requires expertise to interpret.
94	When to Talk and When to Listen	Smaller scope helps get things shipped more quickly.
95	Why Data Science Teams Need Generalists, Not Specialists	Specialization can slow things down. Lean towards full stack ownership.
96	With Great Data Comes Great Responsibility	Consider ethics while building data pipelines.
97	Your Data Tests Failed! Now What?	There are many possible reasons for a failed test.

Written by Brian Leonard on 2021-10-07
Tagged in Company
See all of Brian Leonard's posts.

Brian is the CEO and co-founder of Grouparoo, an open source data framework that easily connects your data to business tools. Brian is a leader and technologist who enjoys hanging out with his family, traveling, learning new things, and building software that makes people's lives easier.

Learn more about Brian @ https://www.linkedin.com/in/brianl429

Share this post

Get Started with Grouparoo

Start syncing your data with Grouparoo Cloud

Start Free Trial

Or download and try our open source Community edition.