data-intensive

Book: Designing Data-Intensive Applications

December’s Book: Designing Data-Intensive Applications by Martin Kleppmann

Why this book: I want a deep dive into databases, and to think more along the lines of a software/systems engineer. 

Takeaway: this a huge, detailed textbook. I used a highlighter liberally. For this post, I want to create a quick list of common, critical questions & topics (basically, an outline!).

  1. Reliability, Scalability, Maintainability
  2. What is the load on the system (think bottleneck)?
  3. Relational Model vs. Document Model (RDMS vs. NoSQL)
  4.  What to index? 
  5. How to log?
  6. Compatibility – Backward, Forward
  7. Encoding… JSON, XML, Protocol Buffers, Avro, etc.
  8. Distributed Data – Scalability, Fault Tolerance/Availability, Latency
  9. Replication: single-leader, multi-leader, leaderless
  10. Replication – synchronous vs. async
  11. Partitioning (Sharding) combined with Replication
  12. ACID: Atomicity, Consistency, Isolation, Durability
  13. Dirty Reads, Dirty Writes
  14. Serialization 
  15. Distributed Systems problems: unreliable networks, faults/partial failures, timeouts, unreliable clocks, process pauses
  16. Consistency through Linearization
  17. Systems of Record vs. Derived Data Systems
  18. Batch Processing vs. Stream Processing
db

What kind of SQL … ??

Tech: Relational Database Management Systems

Challenge: I wanted to take a few moments to read, research, and write down the differences in relational databases. I have experienced this many times – when asking a colleague a high-level conceptual database question, the initial response is “what kind of database is it?”… I don’t always think that is relevant considering the question, but nonetheless I’d like to know the key advantages & disadvantages among these popular systems.

SQLite

  • Advantages
    • light-weight, easy to embed in software
    • very fast read/write
    • no install/configure
    • minimal bugs
    • serverless
    • open-source
  • Disadvantages
    • restricted to 2GB in most cases
    • can only handle low volume transactions
    • no concurrent writes (bad for write-intensive)
    • not multi-user
    • no data type checking (can insert string into integer field)

 
MySQL (side note: LAMP-stack… Linux, Apache, MySQL, PHP)

  • Advantages
    • open source
    • most secure & reliable – used by WordPress, Facebook, Twitter
    • great data security & support for transactional processing (eCommerce)
    • configurable for flawless performance
    • optimized for web applications
    • can run on all major platforms, and supports apps in all popular languages
    • easy to learn
  • Disadvantages
    • might not be great for high concurrency levels
    • helmed by Oracle now… progress has slowed/halted
    • key features are applications and add-ons (text search, ACID compliance)
    • can be limited in areas such as warehousing, fault tolerance, performance diagnostics

 
PostgreSQL

  • Advantages
    • open source
    • exhaustive library & framework support
    • superior query optimizer (great for complex data models)
    • built-in NoSQL key-value store
    • can practically be used for any data problem situation
    • integration with Heroku
    • very reliable
  • Disadvantages
    • so expansive, can be tough to learn
    • slower than MySQL

 
Oracle

  • Advantages
    • all instances backward compatible
    • high functionality for large data sets – many large international banks utilize Oracle
    • high data integrity (aces ACID test)
    • efficient data-recovery tech (Flashback)
    • supports cursors (making programming easier)
  • Disadvantages
    • expensive
    • complex
    • high level of expertise typically needed to properly administer

 
Microsoft SQL Server (MS SQL)

  • Advantages
    • enterprise level 
    • excellent data recovery
    • easy to install
    • great security features (security audits, events can be automatically written to log)
    • great scalability
    • integration w/ .NET framework
  • Disadvantages
    • expensive
    • limited compatibility (Windows-based)
    • uses custom core language

 

Reference:

Wikipedia: RDMSs

 

computer_science_distilled

Book: Computer Science Distilled

November’s Book: Computer Science Distilled by Ferreira Filho

Why this book: I do not have a degree in computer science and this looks to be a good primer to get my mind grapes flowing. My college major, Operations & Information Systems Management, did explore some of these concepts but not in depth. I think it is time for a refresh and a deeper dive.

Final, Final Takeaways:

  • Each chapter ends with a reference list of books to further explore the chapter’s topics. 
  • The colophon states the cover image is from an 1845 schematic by Charles Babbage; the first programmable computer!

Notes/Thoughts:

Chapter 1: Basics

  • Flowchart
    • states & instructions = rectangle
    • decision step = triangle
  • Factorials get BIG FAST. 10! = 3,628,800
  • Fun fact: human DNA has about 3 million base pairs, replicated in each of the 3 trillion cells of the human body
  • Final Takeaways
    • Great refresher on mathematical terms/concepts: truth tables, permutations (with/without identical items), combinations, probabilities: counting and independent, complementary, and mutually exclusive events. Definitely can use this chapter as a reference!
    • XKCD is a great comic
    • The Zebra Puzzle is fun

Chapter 2: Complexity

  • Two main components when considering what an algorithm will “cost”: time and memory (ongoing calculations take up space)
  • Exponential algorithms (2^n) are much more prohibitively expensive than quadratic algorithms (n^2)
  • Final Takeaways
    • I want to learn more about Big O notation – I recognize the term & idea, and feel it is the most vital content of the chapter 
    • I have a book in mind for December (Designing Data-Intensive Applications), but Big O would be a good topic for another month. Eventually, I’d like to be a software or systems engineer and obviously knowing more in this realm will be key!

Chapter 3: Strategy

  • merge algorithm O(n)
    • fixed number of operations
    • think 2 lists of fish, sorting alphabetically
  • power set algorithm O(2^n)
    • double the operations when input size increases by 1
    • power set = truth table
    • think fragrance combinations from a set of flowers
  • recursion
    • remember the base case
    • think palindrome or fibonacci function
    • recursion vs. iteration => it’s a trade-off
      • more expensive: recursion
      • faster speed: iteration
      • higher complexity: iteration
  • brute force (exhaustive search) O(n^2)
    • number of pairs in an interval increases quadratically as interval increases
    • think best trade: in one month, find the day where buying & selling nets the most profit. OR find the optimal buy/sell timeframe
  • backtracing
    • think 8 Queens Puzzle
    • use recursion… once you get a false value, rollback to the last true value and try again
    • “fail early, fail often”
  • heuristics
    • method that leads to a solution that is good enough
    • think chess (man vs. computer): after your first 4 moves, 288 billion possible positions, wow! find a move that is good enough.
    • Greed Approach
      • make best choice at each step, and don’t look back
      • think burglar filling a knapsack. No time to remove or consider items already in the knapsack
  • divide and conquer
    • look for optimal substructure and use recursion
  • memoization
    • think about knapsack… some calculations done repeatedly
    • store the result of the repeated calc
  • branch and bound
    • divide into subproblems
    • find upper & lower bounds of each subproblem
    • compare subproblem bounds w/ all branches
  • Final Takeaway
    • will be very helpful to return to this chapter when tackling a data-parsing problem

Chapter 4: Data

  • Abstract Data Types: how variables of a given data type are operated
    • stack – LI,FO
    • queue – FI,FO
    • list – more flexible than stack or queue; many available data operations
    • sorted list – fewer operators than list, but items always ordered
    • map – stores mappings with a key and value (kinda like a Hash?)
    • set – unordered group of unique items
  • Structures: how variables/data organized & accessed
    • array – instant access, sequential memory space, but can be impractical with large data sets
    • linked list – each item has a pointer to memory location of next item
    • double linked list – each item has pointers in both directions
      • for either linked list… can not find specific nth item in list
    • array vs. list => it’s a trade-off
      • faster insert/delete: list
      • insert/delete at random points: list
      • random, unordered data access: array
      • extreme performance access: array
    • tree
      • think about traversing HTML
      • nodes & edges
      • root node – no parent
      • leaf node – no children
      • height – level of deepest node
    • binary search tree
      • at most, each node can have only 2 children
      • left node < parent, right node > parent
      • binary heap – tree where parent must be greater (or smaller) than both child nodes
    • graph – no child, parent, or root node! most flexible structure!
    • hash – each item is given a memory position; still needs a large chuck of memory set aside
  • Final Takeaway
    • How is a hash different from a map? I know one falls under how data is operated, and the other organized/accessed, but the definitions feel very similar.

Chapter 5: Algorithms

  • Most important, an efficient algorithm likely exists already to solve your issue!
  • Lots of sorting algos: Selection, Insertion, Merge, Quick
  • Searching: Sequential, Binary, or use a Hash!
  • Graphs: Depth First Search vs. Breath First Search => it’s a trade-off!
    • DFS – down, using a Stack
    • BFS – across, using a Queue
    • Simple, less memory: DFS
    • DFS if need to explore all graph nodes
    • BFS if expected location is close to the root
  • classic problem: find the shortest path between nodes
    • try Djkistra Algorithm
    • uses a priority queue (as opposed to BFS, which is an auxiliary queue)
    • huge area? try Bidirectional Search
  • Google, PageRank Algorithm
    • modeled the web as a graph
    • web page = node, links = edges
    • the more edges a page has, the higher the rank! 
  • network, workflow, cost issues? Linear optimization problems are likely best solved with Simplex Method
  • Final Takeaways
    • I think the key here is first to see how your data is modeled, and then look for a algorithm that matches the problem at hand
    • Beware of choice! Successful algorithms can have pitfalls, drawbacks

Chapter 6: Databases

  • The definition of Normalize, Normalization is not what I have used/heard
    • book: The process of transforming a database with replicated data to one without
    • familiar: prepare and/or flatten data for ease of consumption
    • I think I have been wrong here (the very reason I am writing this blog & reading these books lol)!
  • An index is basically a self-balancing binary search tree
  • NoSQL (btw, it can be pronounced either way)
    • most widely known type: document store
      • a data entry contains all info app needs 
      • data entry: document
      • group of documents: collection
  • Graph Databases
    • data entry = node, relationship = edge
    • the most flexible type of DB!
    • data modeled like a network? Graph is likely best
  • Big Data: Volume, Velocity, Variety, (Variability, Veracity)
  • SQL vs. NoSQL => it’s a trade-off!
    • data-centered: SQL
    • maximize structure & eliminate duplication: SQL
    • application-centered: NoSQL
    • faster development: NoSQL
    • time spent on consistency: SQL
  • Distributed
    • Single-Master Replication
      • master receives all queries
      • forwards to slave (which contains a replica of DB)
    • Multi-Master Replication
      • load balancer distributes queries
      • all masters connected
      • write queries propagated amongst all masters
    • Sharding
      • each computer has portion of DB
      • query router send query to correct computer
      • use with replication to avoid one shard failing and having portion of DB unavailable
    • be wary of Data Consistency; Eventual Consistency (many writes among distribution take time to catch up & sync) might not be good enough
  • Serialization: SQL, XML, JSON, CSV

Chapter 7: Computers

  • bus: group of wires for transmitting data (think individual RAM component)
    • address bus: transmit address/location data (unidirectional)
    • data bus: transmit data to and from
  • register: internal memory cell in CPU
  • instruction set: collection of all operations
  • at core of never-ending CPU cycle is Program Counter (PC or BIOS)
    • stores the memory address of next instruction to be executed
    • special register
    • will hold immutable core logic for computer startup
  • CPU clock: number of basic operations per second
    • 2 MHz = two million operations/second
    • quad-core 2 GHz = close to a billion operations/second
  • bit architecture
    • 4 bit = processing binary number instructions up to 4 digits 
    • 8 bit = up to 8, 32 bit = 32 digits, etc
    • thus a 64-bit program can’t run on 32-bit
    • 64-bit register = 2^64 = over 17 billion gigabytes
  • endian
    • little-endian = store numbers left-to-right
    • big-endian = right-to-left
  • compiler
    • converts programming language into machine instructions
    • turing-complete: read/write data, performs conditional branching
    • scripting languages (JS, Ruby, Python) use an interpreter to skip compiling (much slower! but immediate code execution)
    • once compiled, original code is impossible to recover
    • BUT is possible to decode the binary (disassembly). this is reverse-engineering, and frequently how programs are hacked (think pirated software that bypasses auth/download code)
  • memory hierarchy
    • Processor-Memory Gap: RAM is slower
    • the following help bridge the gap
      • Temporal Locality: if used once, likely to be used again
      • Spatial Locality: when address used, near-by addresses likely to be used shortly
      • L1/L2/L3 Cache: contents of memory with high probability of being accessed
    • Main Memory (RAM) – primary
    • Main Storage (DISK) – secondary, could be tape or SSD
  • Final Takeaway

Chapter 8: Programming

  • Values are first-class citizens. I never thought of it this way; I always associated the term “first-class” with functions & JavaScript
  • Paradigms
    • Imperative
      • first!
      • exact instructions at each step with specific commands
      • Machine Code: mnemonic instructions (ex – CP, MOV using Assembly ASM)
      • StructuredGOTO, JUMP to control execution flow, eventually conditionals (for much better control)
      • Procedural: an advancement, allowing dry code & reusability
    • Declarative
      • what you want, not how
      • Functional:
        • functions are first-class, & thus higher-order functions
        • closures 
          • can “remember” a var, and access it in future 
          • allow for clean approach to global var
    • Logic: best for AI, natural language processing
  • Final Takeaway
    • Once, I failed an interview code challenge. I now know why because I did not understand or know about closures!

Bug Troubleshooting w/ New Tools

Tech: Ruby, TigerConnect, Splunk, SSH, CSSHX, grep, curl, HTTParty

Challenge: an application is configured to log an event in Splunk after a successful send of a TigerConnect HIPAA-compliant alert message. The entry point is working, however the alerts are not being sent AND the event is not being logged. After successful troubleshooting to determine what’s NOT wrong, I turned to CSSHX and grep to scour the logs. Why CSSHX… our production instances run on 4 servers concurrently. I want to quickly navigate in 1 terminal tab w/ 4 windows and grep the the logs!

Code:

CLI (for 4 instances) => csshx username@dserver.location.extension username@dserver.location.extension username@dserver.location.extension username@dserver.location.extension

 

grep for various strings => grep -A 10 "error" log_file.log

-A # is for number of lines after the string

-B # is for number of lines before

 

I also wanted to check my HTTParty gem code that transmits the event from app to Splunk. I used a curl statement to mimic the HTTParty call.

CLI => curl https://url.extension:####/services/collector -k -H 'content-type: application/json' -H 'authorization: XXXXXX' -d '{"event":{"app": "data"}, "sourcetype": "_json"}'

-k (–insecure) allows for insecure server connections

-H header key, values

 

I was quite proud to implement the use of these tools when troubleshooting. I have used them in different contexts and it was cool to bring everything together to discover the issue. However (and sadly), the issue was much simpler.

Ruby ENV variable are strings. Even if the string is a “boolean”. 

in .env … VAR=false

if ENV['VAR'] == false then <do something> end

=> equates to FALSE… because ENV['VAR'] exists as a string, it is TRUE.

 

Pattern: Iteration within Interation (pt. 2)

Tech: Ruby

Challenge: returning to iterating! Last time, I wanted the sub-iteration to skip records already looked at, as well as skip the index forward. Let’s gooo!

Code:

i = 0

patients.each_with_index do |row, index|

next if i > index # will skip through rows already processed in the while loop

# using while equal makes it a little easier to understand than part 1’s solution

while patients[i][:id] == patients[index][:id]

# i will increase, and as long as it keeps matching our entry index :id; we are within the same patient’s records

# do logic

i += 1

break if !patients[i] # no more records

end

end

 

Takeaway: I previously implemented part 1’s solution, and this solution in production code. I wonder if there is a scenario where part 1 would be optimal comparatively (I think no because rows are redundantly processed).

okr

2018 Commitment: Update w/ OKR

Wow, with 2 months to go in 2018, I am nowhere near my stated new year’s goals. This has been the busiest year of my life… In addition to getting married in August (a huge time commitment!), I traveled to 17 cities for business, pleasure, and weddings (five total; one of which I was a groomsman). I am disappointed I did not come close to my 2018 stated goals but, looking back, the year has been productive!

Work projects have increased and provide daily challenges. This year, I have worked on Blue Button integration, an Angular project fork, a webapp rewrite in React (from a ColdFusion/PHP stack), and learned & launched applications with new-to-me technologies, Redis and TigerConnect. Frankly, the “9 to 5” work has been demanding enough to satisfy my learning hunger at night & on the weekends. Outside of the daily grind, I released a “live scoring” feature for my college football pick-em app, and won a hackathon

Recently, I have been reading a very interesting book, Measure What Matters by John Doerr. The theories & practices put forth by Doerr enable rapid growth and exceeded expectations. I feel I am a third (eh, maybe half) of the way there… I set the goals, but a yearly timeframe is too long. I set deliverables, but typically very large ones (like release v 1.0 of app!). So, taking Measure What Matters to heart, I am revising my goals into 2 month chunks, and looking to map out the next 6-8 months.

Ostinato Rigore!

Ruby => JS: Fill a New Array

Tech: javascript, ruby

Challenge: in a React app, I have to format some data coming from an API… I want to fill out the API data array to a certain length with empty strings. Coming from a Ruby background, I have grown very used to Ruby shortcuts. How do I tackle in JS?

Code:

incoming API data = ['string', 'string']

application needs an array with length 5… like ['string', 'string', "", "", ""]

 

In Ruby, I calculate the difference in lengths and create a new array with empty strings, contained within the .new args:

Array.new(5 - apiArray.length, "")

 

Javascript is just slightly different. Just need to combine new Array with an ES6 prototype!

new Array(5 - apiArray.length).fill("")

 

 

Reference:

https://ruby-doc.org/core-1.9.3/Array.html#method-i-fill

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/fill

 

hsx_hack_3

HSX MarketStreet Hackathon: Victory!

Tech: Ruby on Rails, node JS backend, HSX api

Challenge: Hackathon! The goal was to facilitate the exchange of patient clinical data for the betterment of patients’ health and healthcare management. I am proud to say our team, The DICE Group @ Jefferson, took first place!

https://www.healthshareexchange.org/news/hsx-marketstreet-launches-hackathon-series-successful-innovative-first-hack

hsx_hack_2

 

Our Idea: 

A business-to-business application facilitating practically instant transmittal of pertinent patient data (immunizations, allergies, conditions) that typically takes multiple steps, people and/or points of contact over 1-3 weeks. In addition to time and logistic savings, a target business like a staffing agency typically pays for new tests & immunizations their client has already received but has no records of (like a tetanus booster or measles vaccination).

Target businesses: staffing agencies, college/universities (think incoming freshmen or a student studying abroad), school system/district, day cares, summer camps.

 

Here are some basic screenshots of the web application demonstrating the user interface:

hsx_1

 

 

hsx_2

 

 

hsx_3

 

 

hsx_4

 

And another shot of the winning team!

hsx_hack_1

JavaScript: Evolved Thinking

Tech: Javascript in React application

Challenge: refactor an archaic iteration to identify the correct string within an array of objects. A code review by a senior programmer identified a section of code ripe for refactor. JavaScript has some great built-in object methods; let’s fully utilize!

Code:

Here is our array with objects identifying URL paths and additional properties

blog_evolved_thinking_pages

 

And the original iteration to identify the api/database’s value for lastScreenSaved, ex – ‘factor_affect’.

blog_evolved_thinking_original_iteration

What’s wrong?? For starters, the iteration continues after identifying the correct index. Second, there are better tools to identify a substring (which we will see in the refactored code).

 

Let’s take a look at the refactored code:

blog_evolved_thinking_refactored

Wow! Much more compact and clean. Allows for quicker understanding, easier testing, and faster refactoring if URL extension changes in future.

 

Takeaway: I tend to think in basic data structures and iterations from my early coding days. It pays to take a minute prior to merge requests to look for & refactor code to better utilize the tools available in mature languages. 

 

Reference:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/findIndex

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/search