Book: Designing Data-Intensive Applications

December’s Book: Designing Data-Intensive Applications by Martin Kleppmann

Why this book: I want a deep dive into databases, and to think more along the lines of a software/systems engineer. 

Takeaway: this a huge, detailed textbook. I used a highlighter liberally. For this post, I want to create a quick list of common, critical questions & topics (basically, an outline!).

  1. Reliability, Scalability, Maintainability
  2. What is the load on the system (think bottleneck)?
  3. Relational Model vs. Document Model (RDMS vs. NoSQL)
  4.  What to index? 
  5. How to log?
  6. Compatibility – Backward, Forward
  7. Encoding… JSON, XML, Protocol Buffers, Avro, etc.
  8. Distributed Data – Scalability, Fault Tolerance/Availability, Latency
  9. Replication: single-leader, multi-leader, leaderless
  10. Replication – synchronous vs. async
  11. Partitioning (Sharding) combined with Replication
  12. ACID: Atomicity, Consistency, Isolation, Durability
  13. Dirty Reads, Dirty Writes
  14. Serialization 
  15. Distributed Systems problems: unreliable networks, faults/partial failures, timeouts, unreliable clocks, process pauses
  16. Consistency through Linearization
  17. Systems of Record vs. Derived Data Systems
  18. Batch Processing vs. Stream Processing

Book: Computer Science Distilled

November’s Book: Computer Science Distilled by Ferreira Filho

Why this book: I do not have a degree in computer science and this looks to be a good primer to get my mind grapes flowing. My college major, Operations & Information Systems Management, did explore some of these concepts but not in depth. I think it is time for a refresh and a deeper dive.

Final, Final Takeaways:

  • Each chapter ends with a reference list of books to further explore the chapter’s topics. 
  • The colophon states the cover image is from an 1845 schematic by Charles Babbage; the first programmable computer!


Chapter 1: Basics

  • Flowchart
    • states & instructions = rectangle
    • decision step = triangle
  • Factorials get BIG FAST. 10! = 3,628,800
  • Fun fact: human DNA has about 3 million base pairs, replicated in each of the 3 trillion cells of the human body
  • Final Takeaways
    • Great refresher on mathematical terms/concepts: truth tables, permutations (with/without identical items), combinations, probabilities: counting and independent, complementary, and mutually exclusive events. Definitely can use this chapter as a reference!
    • XKCD is a great comic
    • The Zebra Puzzle is fun

Chapter 2: Complexity

  • Two main components when considering what an algorithm will “cost”: time and memory (ongoing calculations take up space)
  • Exponential algorithms (2^n) are much more prohibitively expensive than quadratic algorithms (n^2)
  • Final Takeaways
    • I want to learn more about Big O notation – I recognize the term & idea, and feel it is the most vital content of the chapter 
    • I have a book in mind for December (Designing Data-Intensive Applications), but Big O would be a good topic for another month. Eventually, I’d like to be a software or systems engineer and obviously knowing more in this realm will be key!

Chapter 3: Strategy

  • merge algorithm O(n)
    • fixed number of operations
    • think 2 lists of fish, sorting alphabetically
  • power set algorithm O(2^n)
    • double the operations when input size increases by 1
    • power set = truth table
    • think fragrance combinations from a set of flowers
  • recursion
    • remember the base case
    • think palindrome or fibonacci function
    • recursion vs. iteration => it’s a trade-off
      • more expensive: recursion
      • faster speed: iteration
      • higher complexity: iteration
  • brute force (exhaustive search) O(n^2)
    • number of pairs in an interval increases quadratically as interval increases
    • think best trade: in one month, find the day where buying & selling nets the most profit. OR find the optimal buy/sell timeframe
  • backtracing
    • think 8 Queens Puzzle
    • use recursion… once you get a false value, rollback to the last true value and try again
    • “fail early, fail often”
  • heuristics
    • method that leads to a solution that is good enough
    • think chess (man vs. computer): after your first 4 moves, 288 billion possible positions, wow! find a move that is good enough.
    • Greed Approach
      • make best choice at each step, and don’t look back
      • think burglar filling a knapsack. No time to remove or consider items already in the knapsack
  • divide and conquer
    • look for optimal substructure and use recursion
  • memoization
    • think about knapsack… some calculations done repeatedly
    • store the result of the repeated calc
  • branch and bound
    • divide into subproblems
    • find upper & lower bounds of each subproblem
    • compare subproblem bounds w/ all branches
  • Final Takeaway
    • will be very helpful to return to this chapter when tackling a data-parsing problem

Chapter 4: Data

  • Abstract Data Types: how variables of a given data type are operated
    • stack – LI,FO
    • queue – FI,FO
    • list – more flexible than stack or queue; many available data operations
    • sorted list – fewer operators than list, but items always ordered
    • map – stores mappings with a key and value (kinda like a Hash?)
    • set – unordered group of unique items
  • Structures: how variables/data organized & accessed
    • array – instant access, sequential memory space, but can be impractical with large data sets
    • linked list – each item has a pointer to memory location of next item
    • double linked list – each item has pointers in both directions
      • for either linked list… can not find specific nth item in list
    • array vs. list => it’s a trade-off
      • faster insert/delete: list
      • insert/delete at random points: list
      • random, unordered data access: array
      • extreme performance access: array
    • tree
      • think about traversing HTML
      • nodes & edges
      • root node – no parent
      • leaf node – no children
      • height – level of deepest node
    • binary search tree
      • at most, each node can have only 2 children
      • left node < parent, right node > parent
      • binary heap – tree where parent must be greater (or smaller) than both child nodes
    • graph – no child, parent, or root node! most flexible structure!
    • hash – each item is given a memory position; still needs a large chuck of memory set aside
  • Final Takeaway
    • How is a hash different from a map? I know one falls under how data is operated, and the other organized/accessed, but the definitions feel very similar.

Chapter 5: Algorithms

  • Most important, an efficient algorithm likely exists already to solve your issue!
  • Lots of sorting algos: Selection, Insertion, Merge, Quick
  • Searching: Sequential, Binary, or use a Hash!
  • Graphs: Depth First Search vs. Breath First Search => it’s a trade-off!
    • DFS – down, using a Stack
    • BFS – across, using a Queue
    • Simple, less memory: DFS
    • DFS if need to explore all graph nodes
    • BFS if expected location is close to the root
  • classic problem: find the shortest path between nodes
    • try Djkistra Algorithm
    • uses a priority queue (as opposed to BFS, which is an auxiliary queue)
    • huge area? try Bidirectional Search
  • Google, PageRank Algorithm
    • modeled the web as a graph
    • web page = node, links = edges
    • the more edges a page has, the higher the rank! 
  • network, workflow, cost issues? Linear optimization problems are likely best solved with Simplex Method
  • Final Takeaways
    • I think the key here is first to see how your data is modeled, and then look for a algorithm that matches the problem at hand
    • Beware of choice! Successful algorithms can have pitfalls, drawbacks

Chapter 6: Databases

  • The definition of Normalize, Normalization is not what I have used/heard
    • book: The process of transforming a database with replicated data to one without
    • familiar: prepare and/or flatten data for ease of consumption
    • I think I have been wrong here (the very reason I am writing this blog & reading these books lol)!
  • An index is basically a self-balancing binary search tree
  • NoSQL (btw, it can be pronounced either way)
    • most widely known type: document store
      • a data entry contains all info app needs 
      • data entry: document
      • group of documents: collection
  • Graph Databases
    • data entry = node, relationship = edge
    • the most flexible type of DB!
    • data modeled like a network? Graph is likely best
  • Big Data: Volume, Velocity, Variety, (Variability, Veracity)
  • SQL vs. NoSQL => it’s a trade-off!
    • data-centered: SQL
    • maximize structure & eliminate duplication: SQL
    • application-centered: NoSQL
    • faster development: NoSQL
    • time spent on consistency: SQL
  • Distributed
    • Single-Master Replication
      • master receives all queries
      • forwards to slave (which contains a replica of DB)
    • Multi-Master Replication
      • load balancer distributes queries
      • all masters connected
      • write queries propagated amongst all masters
    • Sharding
      • each computer has portion of DB
      • query router send query to correct computer
      • use with replication to avoid one shard failing and having portion of DB unavailable
    • be wary of Data Consistency; Eventual Consistency (many writes among distribution take time to catch up & sync) might not be good enough
  • Serialization: SQL, XML, JSON, CSV

Chapter 7: Computers

  • bus: group of wires for transmitting data (think individual RAM component)
    • address bus: transmit address/location data (unidirectional)
    • data bus: transmit data to and from
  • register: internal memory cell in CPU
  • instruction set: collection of all operations
  • at core of never-ending CPU cycle is Program Counter (PC or BIOS)
    • stores the memory address of next instruction to be executed
    • special register
    • will hold immutable core logic for computer startup
  • CPU clock: number of basic operations per second
    • 2 MHz = two million operations/second
    • quad-core 2 GHz = close to a billion operations/second
  • bit architecture
    • 4 bit = processing binary number instructions up to 4 digits 
    • 8 bit = up to 8, 32 bit = 32 digits, etc
    • thus a 64-bit program can’t run on 32-bit
    • 64-bit register = 2^64 = over 17 billion gigabytes
  • endian
    • little-endian = store numbers left-to-right
    • big-endian = right-to-left
  • compiler
    • converts programming language into machine instructions
    • turing-complete: read/write data, performs conditional branching
    • scripting languages (JS, Ruby, Python) use an interpreter to skip compiling (much slower! but immediate code execution)
    • once compiled, original code is impossible to recover
    • BUT is possible to decode the binary (disassembly). this is reverse-engineering, and frequently how programs are hacked (think pirated software that bypasses auth/download code)
  • memory hierarchy
    • Processor-Memory Gap: RAM is slower
    • the following help bridge the gap
      • Temporal Locality: if used once, likely to be used again
      • Spatial Locality: when address used, near-by addresses likely to be used shortly
      • L1/L2/L3 Cache: contents of memory with high probability of being accessed
    • Main Memory (RAM) – primary
    • Main Storage (DISK) – secondary, could be tape or SSD
  • Final Takeaway

Chapter 8: Programming

  • Values are first-class citizens. I never thought of it this way; I always associated the term “first-class” with functions & JavaScript
  • Paradigms
    • Imperative
      • first!
      • exact instructions at each step with specific commands
      • Machine Code: mnemonic instructions (ex – CP, MOV using Assembly ASM)
      • StructuredGOTO, JUMP to control execution flow, eventually conditionals (for much better control)
      • Procedural: an advancement, allowing dry code & reusability
    • Declarative
      • what you want, not how
      • Functional:
        • functions are first-class, & thus higher-order functions
        • closures 
          • can “remember” a var, and access it in future 
          • allow for clean approach to global var
    • Logic: best for AI, natural language processing
  • Final Takeaway
    • Once, I failed an interview code challenge. I now know why because I did not understand or know about closures!