Book: Designing Data-Intensive Applications

December’s Book: Designing Data-Intensive Applications by Martin Kleppmann

Why this book: I want a deep dive into databases, and to think more along the lines of a software/systems engineer. 

Takeaway: this a huge, detailed textbook. I used a highlighter liberally. For this post, I want to create a quick list of common, critical questions & topics (basically, an outline!).

  1. Reliability, Scalability, Maintainability
  2. What is the load on the system (think bottleneck)?
  3. Relational Model vs. Document Model (RDMS vs. NoSQL)
  4.  What to index? 
  5. How to log?
  6. Compatibility – Backward, Forward
  7. Encoding… JSON, XML, Protocol Buffers, Avro, etc.
  8. Distributed Data – Scalability, Fault Tolerance/Availability, Latency
  9. Replication: single-leader, multi-leader, leaderless
  10. Replication – synchronous vs. async
  11. Partitioning (Sharding) combined with Replication
  12. ACID: Atomicity, Consistency, Isolation, Durability
  13. Dirty Reads, Dirty Writes
  14. Serialization 
  15. Distributed Systems problems: unreliable networks, faults/partial failures, timeouts, unreliable clocks, process pauses
  16. Consistency through Linearization
  17. Systems of Record vs. Derived Data Systems
  18. Batch Processing vs. Stream Processing

What kind of SQL … ??

Tech: Relational Database Management Systems

Challenge: I wanted to take a few moments to read, research, and write down the differences in relational databases. I have experienced this many times – when asking a colleague a high-level conceptual database question, the initial response is “what kind of database is it?”… I don’t always think that is relevant considering the question, but nonetheless I’d like to know the key advantages & disadvantages among these popular systems.


  • Advantages
    • light-weight, easy to embed in software
    • very fast read/write
    • no install/configure
    • minimal bugs
    • serverless
    • open-source
  • Disadvantages
    • restricted to 2GB in most cases
    • can only handle low volume transactions
    • no concurrent writes (bad for write-intensive)
    • not multi-user
    • no data type checking (can insert string into integer field)

MySQL (side note: LAMP-stack… Linux, Apache, MySQL, PHP)

  • Advantages
    • open source
    • most secure & reliable – used by WordPress, Facebook, Twitter
    • great data security & support for transactional processing (eCommerce)
    • configurable for flawless performance
    • optimized for web applications
    • can run on all major platforms, and supports apps in all popular languages
    • easy to learn
  • Disadvantages
    • might not be great for high concurrency levels
    • helmed by Oracle now… progress has slowed/halted
    • key features are applications and add-ons (text search, ACID compliance)
    • can be limited in areas such as warehousing, fault tolerance, performance diagnostics


  • Advantages
    • open source
    • exhaustive library & framework support
    • superior query optimizer (great for complex data models)
    • built-in NoSQL key-value store
    • can practically be used for any data problem situation
    • integration with Heroku
    • very reliable
  • Disadvantages
    • so expansive, can be tough to learn
    • slower than MySQL


  • Advantages
    • all instances backward compatible
    • high functionality for large data sets – many large international banks utilize Oracle
    • high data integrity (aces ACID test)
    • efficient data-recovery tech (Flashback)
    • supports cursors (making programming easier)
  • Disadvantages
    • expensive
    • complex
    • high level of expertise typically needed to properly administer

Microsoft SQL Server (MS SQL)

  • Advantages
    • enterprise level 
    • excellent data recovery
    • easy to install
    • great security features (security audits, events can be automatically written to log)
    • great scalability
    • integration w/ .NET framework
  • Disadvantages
    • expensive
    • limited compatibility (Windows-based)
    • uses custom core language



Wikipedia: RDMSs



The Power of… Backups!

Code: Heroku, ruby, activerecord

Challenge: database restoration! During my almost-daily updating of during the NBA season, I accidentally updated ALL the records instead of a specific record. Whoops. I was instantly crushed, angry, and rueful. But much like this blog, present-Luke is very thankful for past-Luke…. say hello to Heroku backups! Luckily, I had taken advantage of Heroku’s daily backup feature and just a few commands would restore the DB to its previous glory. (BTW, I realize consoling into the production database for updates is a bad idea. I have a backlog item to better automate this with a script & endpoint, but personal projects typically take a slower path.)

First, the infraction:



Let’s take a look at what’s available:



And restore!



Takeaway: although not a DB administrator, I will always take a moment to ensure a backup database plan is discussed & implemented for any project! Too dangerous otherwise. And always… Trust the Process.




Heroku: Scheduling DB Backups

Tech: Heroku, PostgreSQL, Rails, Ruby

Challenge: setup and verify production database backup


brew install heroku (if not already installed)

git remote add heroku

(you can find the remote URL in Settings, then Info)


run a one-time backup => heroku pg:backups:capture –app app_name

check last backup => heroku pg:backups:info


and scheduling a regular backup….

heroku pg:backups:schedule DATABASE_URL –at ’05:00 America/New_York’ –app app_name

(Note: DATABASE_URL is a Heroku config variable. If your CLI is properly configured to connect to Heroku, then the literal string ‘DATABASE_URL’ is the command. If you want to see the DB URL to back up, head to Settings, then Config Variables.)




view your backup info:

heroku pg:backups –app app_name

(A few days after writing this post, I checked back in on the progress…. )



bonus, set the timezone of the Heroku PG instance:

heroku config:add TZ=”America/New_York”