Book: Designing Data-Intensive Applications

December’s Book: Designing Data-Intensive Applications by Martin Kleppmann

Why this book: I want a deep dive into databases, and to think more along the lines of a software/systems engineer. 

Takeaway: this a huge, detailed textbook. I used a highlighter liberally. For this post, I want to create a quick list of common, critical questions & topics (basically, an outline!).

  1. Reliability, Scalability, Maintainability
  2. What is the load on the system (think bottleneck)?
  3. Relational Model vs. Document Model (RDMS vs. NoSQL)
  4.  What to index? 
  5. How to log?
  6. Compatibility – Backward, Forward
  7. Encoding… JSON, XML, Protocol Buffers, Avro, etc.
  8. Distributed Data – Scalability, Fault Tolerance/Availability, Latency
  9. Replication: single-leader, multi-leader, leaderless
  10. Replication – synchronous vs. async
  11. Partitioning (Sharding) combined with Replication
  12. ACID: Atomicity, Consistency, Isolation, Durability
  13. Dirty Reads, Dirty Writes
  14. Serialization 
  15. Distributed Systems problems: unreliable networks, faults/partial failures, timeouts, unreliable clocks, process pauses
  16. Consistency through Linearization
  17. Systems of Record vs. Derived Data Systems
  18. Batch Processing vs. Stream Processing

Pattern: Iteration within Interation (pt. 2)

Tech: Ruby

Challenge: returning to iterating! Last time, I wanted the sub-iteration to skip records already looked at, as well as skip the index forward. Let’s gooo!


i = 0

patients.each_with_index do |row, index|

next if i > index # will skip through rows already processed in the while loop

# using while equal makes it a little easier to understand than part 1’s solution

while patients[i][:id] == patients[index][:id]

# i will increase, and as long as it keeps matching our entry index :id; we are within the same patient’s records

# do logic

i += 1

break if !patients[i] # no more records




Takeaway: I previously implemented part 1’s solution, and this solution in production code. I wonder if there is a scenario where part 1 would be optimal comparatively (I think no because rows are redundantly processed).


Pattern: Iteration within Interation (pt. 1)

Tech: Ruby

Challenge: need to iterate over a set of patient records…within the set, each patient has several records… so when the initial iteration gets to a new patient, the code needs to include a sub-iteration specific to the patient. For example, a patient’s subset of data includes their location & treatment history. We want to capture the patient’s last location, but their first instance of treatment. The data is extracted into a patient-specific data hash.


patients.each_with_index do |row, index|

i = index
until patients[i]['ID'] != patients[index]['ID']

update_patient_hash(row['ID'], data)

i += 1
break if !patients[i] # no more records




Takeaway: this could benefit from refactoring…

  1. The sub-iteration repeats the processing of records due to reference to the row for data capture, instead of patient[i][data]
  2. Could the initial iteration’s index be reset/moved to the new patient once the sub-iteration is finished?