Things I Learned Building a Gem

Tech: Ruby, Rubocop, Splunk, Docker, CI, Rspec, TigerConnect

Challenge: a particular component (HIPAA-compliant text messaging) of an application proved valuable in other applications. The code was organically customized and integrated into several applications. Eventually, we had a clear case to pull out the duplicated code (DRY!) from the various repos, standardize, and package as a gem.

What I learned:

  1. send is a Ruby reserved keyword (for dynamically calling method names)
  2. How to pass creds for a private GitLab/GitHub repo into Gemfile, so the gem can be implemented (gitlab-ci-token)
  3. After updating the gem, and using bundle update in an app implementing the gem, sometimes you must force pull from the private repo to get the latest changes.
  4. GitLab CI w/ RSpec, Rubocop!
  5. garbage collection https://blog.codeship.com/visualizing-garbage-collection-ruby-python/
  6. Splunk search is NOT case sensitive!

 

And topics/pitfalls for further research:

  1. Oddly, my Mail settings took in the local environment, but not in the application Docker image utilizing the gem
  2. In the local environment, one dynamic require all in the main module worked nicely for files in ./lib. However, needed multiple require statements (one for each file) when implementing the gem elsewhere.
  3. Ignoring Gemfile.lock in commits was helpful when trying to implement gem in applications.

 

Reference:

Generic Gem Template

Building a Gem Guide

Naming, Versioning, Dependencies, etc.

Bundler – creating gem

GitLab Token

data-intensive

Book: Designing Data-Intensive Applications

December’s Book: Designing Data-Intensive Applications by Martin Kleppmann

Why this book: I want a deep dive into databases, and to think more along the lines of a software/systems engineer. 

Takeaway: this a huge, detailed textbook. I used a highlighter liberally. For this post, I want to create a quick list of common, critical questions & topics (basically, an outline!).

  1. Reliability, Scalability, Maintainability
  2. What is the load on the system (think bottleneck)?
  3. Relational Model vs. Document Model (RDMS vs. NoSQL)
  4.  What to index? 
  5. How to log?
  6. Compatibility – Backward, Forward
  7. Encoding… JSON, XML, Protocol Buffers, Avro, etc.
  8. Distributed Data – Scalability, Fault Tolerance/Availability, Latency
  9. Replication: single-leader, multi-leader, leaderless
  10. Replication – synchronous vs. async
  11. Partitioning (Sharding) combined with Replication
  12. ACID: Atomicity, Consistency, Isolation, Durability
  13. Dirty Reads, Dirty Writes
  14. Serialization 
  15. Distributed Systems problems: unreliable networks, faults/partial failures, timeouts, unreliable clocks, process pauses
  16. Consistency through Linearization
  17. Systems of Record vs. Derived Data Systems
  18. Batch Processing vs. Stream Processing
db

What kind of SQL … ??

Tech: Relational Database Management Systems

Challenge: I wanted to take a few moments to read, research, and write down the differences in relational databases. I have experienced this many times – when asking a colleague a high-level conceptual database question, the initial response is “what kind of database is it?”… I don’t always think that is relevant considering the question, but nonetheless I’d like to know the key advantages & disadvantages among these popular systems.

SQLite

  • Advantages
    • light-weight, easy to embed in software
    • very fast read/write
    • no install/configure
    • minimal bugs
    • serverless
    • open-source
  • Disadvantages
    • restricted to 2GB in most cases
    • can only handle low volume transactions
    • no concurrent writes (bad for write-intensive)
    • not multi-user
    • no data type checking (can insert string into integer field)

 
MySQL (side note: LAMP-stack… Linux, Apache, MySQL, PHP)

  • Advantages
    • open source
    • most secure & reliable – used by WordPress, Facebook, Twitter
    • great data security & support for transactional processing (eCommerce)
    • configurable for flawless performance
    • optimized for web applications
    • can run on all major platforms, and supports apps in all popular languages
    • easy to learn
  • Disadvantages
    • might not be great for high concurrency levels
    • helmed by Oracle now… progress has slowed/halted
    • key features are applications and add-ons (text search, ACID compliance)
    • can be limited in areas such as warehousing, fault tolerance, performance diagnostics

 
PostgreSQL

  • Advantages
    • open source
    • exhaustive library & framework support
    • superior query optimizer (great for complex data models)
    • built-in NoSQL key-value store
    • can practically be used for any data problem situation
    • integration with Heroku
    • very reliable
  • Disadvantages
    • so expansive, can be tough to learn
    • slower than MySQL

 
Oracle

  • Advantages
    • all instances backward compatible
    • high functionality for large data sets – many large international banks utilize Oracle
    • high data integrity (aces ACID test)
    • efficient data-recovery tech (Flashback)
    • supports cursors (making programming easier)
  • Disadvantages
    • expensive
    • complex
    • high level of expertise typically needed to properly administer

 
Microsoft SQL Server (MS SQL)

  • Advantages
    • enterprise level 
    • excellent data recovery
    • easy to install
    • great security features (security audits, events can be automatically written to log)
    • great scalability
    • integration w/ .NET framework
  • Disadvantages
    • expensive
    • limited compatibility (Windows-based)
    • uses custom core language

 

Reference:

Wikipedia: RDMSs

 

computer_science_distilled

Book: Computer Science Distilled

November’s Book: Computer Science Distilled by Ferreira Filho

Why this book: I do not have a degree in computer science and this looks to be a good primer to get my mind grapes flowing. My college major, Operations & Information Systems Management, did explore some of these concepts but not in depth. I think it is time for a refresh and a deeper dive.

Final, Final Takeaways:

  • Each chapter ends with a reference list of books to further explore the chapter’s topics. 
  • The colophon states the cover image is from an 1845 schematic by Charles Babbage; the first programmable computer!

Notes/Thoughts:

Chapter 1: Basics

  • Flowchart
    • states & instructions = rectangle
    • decision step = triangle
  • Factorials get BIG FAST. 10! = 3,628,800
  • Fun fact: human DNA has about 3 million base pairs, replicated in each of the 3 trillion cells of the human body
  • Final Takeaways
    • Great refresher on mathematical terms/concepts: truth tables, permutations (with/without identical items), combinations, probabilities: counting and independent, complementary, and mutually exclusive events. Definitely can use this chapter as a reference!
    • XKCD is a great comic
    • The Zebra Puzzle is fun

Chapter 2: Complexity

  • Two main components when considering what an algorithm will “cost”: time and memory (ongoing calculations take up space)
  • Exponential algorithms (2^n) are much more prohibitively expensive than quadratic algorithms (n^2)
  • Final Takeaways
    • I want to learn more about Big O notation – I recognize the term & idea, and feel it is the most vital content of the chapter 
    • I have a book in mind for December (Designing Data-Intensive Applications), but Big O would be a good topic for another month. Eventually, I’d like to be a software or systems engineer and obviously knowing more in this realm will be key!

Chapter 3: Strategy

  • merge algorithm O(n)
    • fixed number of operations
    • think 2 lists of fish, sorting alphabetically
  • power set algorithm O(2^n)
    • double the operations when input size increases by 1
    • power set = truth table
    • think fragrance combinations from a set of flowers
  • recursion
    • remember the base case
    • think palindrome or fibonacci function
    • recursion vs. iteration => it’s a trade-off
      • more expensive: recursion
      • faster speed: iteration
      • higher complexity: iteration
  • brute force (exhaustive search) O(n^2)
    • number of pairs in an interval increases quadratically as interval increases
    • think best trade: in one month, find the day where buying & selling nets the most profit. OR find the optimal buy/sell timeframe
  • backtracing
    • think 8 Queens Puzzle
    • use recursion… once you get a false value, rollback to the last true value and try again
    • “fail early, fail often”
  • heuristics
    • method that leads to a solution that is good enough
    • think chess (man vs. computer): after your first 4 moves, 288 billion possible positions, wow! find a move that is good enough.
    • Greed Approach
      • make best choice at each step, and don’t look back
      • think burglar filling a knapsack. No time to remove or consider items already in the knapsack
  • divide and conquer
    • look for optimal substructure and use recursion
  • memoization
    • think about knapsack… some calculations done repeatedly
    • store the result of the repeated calc
  • branch and bound
    • divide into subproblems
    • find upper & lower bounds of each subproblem
    • compare subproblem bounds w/ all branches
  • Final Takeaway
    • will be very helpful to return to this chapter when tackling a data-parsing problem

Chapter 4: Data

  • Abstract Data Types: how variables of a given data type are operated
    • stack – LI,FO
    • queue – FI,FO
    • list – more flexible than stack or queue; many available data operations
    • sorted list – fewer operators than list, but items always ordered
    • map – stores mappings with a key and value (kinda like a Hash?)
    • set – unordered group of unique items
  • Structures: how variables/data organized & accessed
    • array – instant access, sequential memory space, but can be impractical with large data sets
    • linked list – each item has a pointer to memory location of next item
    • double linked list – each item has pointers in both directions
      • for either linked list… can not find specific nth item in list
    • array vs. list => it’s a trade-off
      • faster insert/delete: list
      • insert/delete at random points: list
      • random, unordered data access: array
      • extreme performance access: array
    • tree
      • think about traversing HTML
      • nodes & edges
      • root node – no parent
      • leaf node – no children
      • height – level of deepest node
    • binary search tree
      • at most, each node can have only 2 children
      • left node < parent, right node > parent
      • binary heap – tree where parent must be greater (or smaller) than both child nodes
    • graph – no child, parent, or root node! most flexible structure!
    • hash – each item is given a memory position; still needs a large chuck of memory set aside
  • Final Takeaway
    • How is a hash different from a map? I know one falls under how data is operated, and the other organized/accessed, but the definitions feel very similar.

Chapter 5: Algorithms

  • Most important, an efficient algorithm likely exists already to solve your issue!
  • Lots of sorting algos: Selection, Insertion, Merge, Quick
  • Searching: Sequential, Binary, or use a Hash!
  • Graphs: Depth First Search vs. Breath First Search => it’s a trade-off!
    • DFS – down, using a Stack
    • BFS – across, using a Queue
    • Simple, less memory: DFS
    • DFS if need to explore all graph nodes
    • BFS if expected location is close to the root
  • classic problem: find the shortest path between nodes
    • try Djkistra Algorithm
    • uses a priority queue (as opposed to BFS, which is an auxiliary queue)
    • huge area? try Bidirectional Search
  • Google, PageRank Algorithm
    • modeled the web as a graph
    • web page = node, links = edges
    • the more edges a page has, the higher the rank! 
  • network, workflow, cost issues? Linear optimization problems are likely best solved with Simplex Method
  • Final Takeaways
    • I think the key here is first to see how your data is modeled, and then look for a algorithm that matches the problem at hand
    • Beware of choice! Successful algorithms can have pitfalls, drawbacks

Chapter 6: Databases

  • The definition of Normalize, Normalization is not what I have used/heard
    • book: The process of transforming a database with replicated data to one without
    • familiar: prepare and/or flatten data for ease of consumption
    • I think I have been wrong here (the very reason I am writing this blog & reading these books lol)!
  • An index is basically a self-balancing binary search tree
  • NoSQL (btw, it can be pronounced either way)
    • most widely known type: document store
      • a data entry contains all info app needs 
      • data entry: document
      • group of documents: collection
  • Graph Databases
    • data entry = node, relationship = edge
    • the most flexible type of DB!
    • data modeled like a network? Graph is likely best
  • Big Data: Volume, Velocity, Variety, (Variability, Veracity)
  • SQL vs. NoSQL => it’s a trade-off!
    • data-centered: SQL
    • maximize structure & eliminate duplication: SQL
    • application-centered: NoSQL
    • faster development: NoSQL
    • time spent on consistency: SQL
  • Distributed
    • Single-Master Replication
      • master receives all queries
      • forwards to slave (which contains a replica of DB)
    • Multi-Master Replication
      • load balancer distributes queries
      • all masters connected
      • write queries propagated amongst all masters
    • Sharding
      • each computer has portion of DB
      • query router send query to correct computer
      • use with replication to avoid one shard failing and having portion of DB unavailable
    • be wary of Data Consistency; Eventual Consistency (many writes among distribution take time to catch up & sync) might not be good enough
  • Serialization: SQL, XML, JSON, CSV

Chapter 7: Computers

  • bus: group of wires for transmitting data (think individual RAM component)
    • address bus: transmit address/location data (unidirectional)
    • data bus: transmit data to and from
  • register: internal memory cell in CPU
  • instruction set: collection of all operations
  • at core of never-ending CPU cycle is Program Counter (PC or BIOS)
    • stores the memory address of next instruction to be executed
    • special register
    • will hold immutable core logic for computer startup
  • CPU clock: number of basic operations per second
    • 2 MHz = two million operations/second
    • quad-core 2 GHz = close to a billion operations/second
  • bit architecture
    • 4 bit = processing binary number instructions up to 4 digits 
    • 8 bit = up to 8, 32 bit = 32 digits, etc
    • thus a 64-bit program can’t run on 32-bit
    • 64-bit register = 2^64 = over 17 billion gigabytes
  • endian
    • little-endian = store numbers left-to-right
    • big-endian = right-to-left
  • compiler
    • converts programming language into machine instructions
    • turing-complete: read/write data, performs conditional branching
    • scripting languages (JS, Ruby, Python) use an interpreter to skip compiling (much slower! but immediate code execution)
    • once compiled, original code is impossible to recover
    • BUT is possible to decode the binary (disassembly). this is reverse-engineering, and frequently how programs are hacked (think pirated software that bypasses auth/download code)
  • memory hierarchy
    • Processor-Memory Gap: RAM is slower
    • the following help bridge the gap
      • Temporal Locality: if used once, likely to be used again
      • Spatial Locality: when address used, near-by addresses likely to be used shortly
      • L1/L2/L3 Cache: contents of memory with high probability of being accessed
    • Main Memory (RAM) – primary
    • Main Storage (DISK) – secondary, could be tape or SSD
  • Final Takeaway

Chapter 8: Programming

  • Values are first-class citizens. I never thought of it this way; I always associated the term “first-class” with functions & JavaScript
  • Paradigms
    • Imperative
      • first!
      • exact instructions at each step with specific commands
      • Machine Code: mnemonic instructions (ex – CP, MOV using Assembly ASM)
      • StructuredGOTO, JUMP to control execution flow, eventually conditionals (for much better control)
      • Procedural: an advancement, allowing dry code & reusability
    • Declarative
      • what you want, not how
      • Functional:
        • functions are first-class, & thus higher-order functions
        • closures 
          • can “remember” a var, and access it in future 
          • allow for clean approach to global var
    • Logic: best for AI, natural language processing
  • Final Takeaway
    • Once, I failed an interview code challenge. I now know why because I did not understand or know about closures!

Bug Troubleshooting w/ New Tools

Tech: Ruby, TigerConnect, Splunk, SSH, CSSHX, grep, curl, HTTParty

Challenge: an application is configured to log an event in Splunk after a successful send of a TigerConnect HIPAA-compliant alert message. The entry point is working, however the alerts are not being sent AND the event is not being logged. After successful troubleshooting to determine what’s NOT wrong, I turned to CSSHX and grep to scour the logs. Why CSSHX… our production instances run on 4 servers concurrently. I want to quickly navigate in 1 terminal tab w/ 4 windows and grep the the logs!

Code:

CLI (for 4 instances) => csshx username@dserver.location.extension username@dserver.location.extension username@dserver.location.extension username@dserver.location.extension

 

grep for various strings => grep -A 10 "error" log_file.log

-A # is for number of lines after the string

-B # is for number of lines before

 

I also wanted to check my HTTParty gem code that transmits the event from app to Splunk. I used a curl statement to mimic the HTTParty call.

CLI => curl https://url.extension:####/services/collector -k -H 'content-type: application/json' -H 'authorization: XXXXXX' -d '{"event":{"app": "data"}, "sourcetype": "_json"}'

-k (–insecure) allows for insecure server connections

-H header key, values

 

I was quite proud to implement the use of these tools when troubleshooting. I have used them in different contexts and it was cool to bring everything together to discover the issue. However (and sadly), the issue was much simpler.

Ruby ENV variable are strings. Even if the string is a “boolean”. 

in .env … VAR=false

if ENV['VAR'] == false then <do something> end

=> equates to FALSE… because ENV['VAR'] exists as a string, it is TRUE.

 

Pattern: Iteration within Interation (pt. 2)

Tech: Ruby

Challenge: returning to iterating! Last time, I wanted the sub-iteration to skip records already looked at, as well as skip the index forward. Let’s gooo!

Code:

i = 0

patients.each_with_index do |row, index|

next if i > index # will skip through rows already processed in the while loop

# using while equal makes it a little easier to understand than part 1’s solution

while patients[i][:id] == patients[index][:id]

# i will increase, and as long as it keeps matching our entry index :id; we are within the same patient’s records

# do logic

i += 1

break if !patients[i] # no more records

end

end

 

Takeaway: I previously implemented part 1’s solution, and this solution in production code. I wonder if there is a scenario where part 1 would be optimal comparatively (I think no because rows are redundantly processed).

okr

2018 Commitment: Update w/ OKR

Wow, with 2 months to go in 2018, I am nowhere near my stated new year’s goals. This has been the busiest year of my life… In addition to getting married in August (a huge time commitment!), I traveled to 17 cities for business, pleasure, and weddings (five total; one of which I was a groomsman). I am disappointed I did not come close to my 2018 stated goals but, looking back, the year has been productive!

Work projects have increased and provide daily challenges. This year, I have worked on Blue Button integration, an Angular project fork, a webapp rewrite in React (from a ColdFusion/PHP stack), and learned & launched applications with new-to-me technologies, Redis and TigerConnect. Frankly, the “9 to 5” work has been demanding enough to satisfy my learning hunger at night & on the weekends. Outside of the daily grind, I released a “live scoring” feature for my college football pick-em app, and won a hackathon

Recently, I have been reading a very interesting book, Measure What Matters by John Doerr. The theories & practices put forth by Doerr enable rapid growth and exceeded expectations. I feel I am a third (eh, maybe half) of the way there… I set the goals, but a yearly timeframe is too long. I set deliverables, but typically very large ones (like release v 1.0 of app!). So, taking Measure What Matters to heart, I am revising my goals into 2 month chunks, and looking to map out the next 6-8 months.

Ostinato Rigore!

published_blue_ribbon

Published!

Tech: Angular, Angular-Cli, Docker, nginx

Challenge: the challenge was making & releasing the app. Utilized by the American Association of Colleges of Osteopathic Medicine for studying empathy, I am happy to report no bugs or issues since the release approximately 1 year ago! The icing on the cake is a small acknowledgment in a published research paper. The pudding..

published

 

 

 

 

 

published_pomee

 

slider

Undesired rc-slider onAfterChange Event

Tech: javascript, npm rc-slider, react

Challenge: The webapp designer choose to have both the rc-slider’s initial state (null), and the lowest actual value (1), be represented in the lowest/left-most node. If the User simply clicks the left-most node, the null value updates to 1 through the RcSlider onAfterChange API. The problem is whenever the next click occurs… anywhere on the page OR on any element… onAfterChange fires again, which is undesired!

Code:

The rc-slider is wrapped with a custom element. We want the rc-slider onAfterChange event to bubble up, so we pass down our “onClick” function. (Note: these elements & functions have been stripped for ease of explanation.)

onClick function:

onSliderClick = (factorIndex, value) => {

if (value === 0) {

this.props.sliderClickedFromDefault(factorIndex, 0)

}

}

React elements:

parent: <Slider onSliderClick={() => onSliderClick(factorIndex, value)} />

child: <RcSlider onAfterChange={onSliderClick} />

 

The initial click of the rc-slider works great! During troubleshooting, we capture the document.activeElement:

document_active_element_1

 

It is the second click, anywhere on the page OR on any element, that causes the issue. Here is the Redux Inspector Action Event Log:

react_log

 

After reading through known rc-slider issues, and speaking with the Great Google, I have not yet found a solid solution or explanation as to why this is happening. My workaround…

Let’s check the activeElement the second time onAfterChange fires. If it is not my desired element (the slider!), do not dispatch an action!

onSliderClick = (factorIndex, value) => {

const activeClass = document.activeElement.className

if (value === 0 && activeClass === 'rc-slider-handle rc-slider-handle-click-focused') {

this.props.sliderClickedFromDefault(factorIndex, 0)

}

}

 

Takeaway: I do not love this solution and am convinced something else is going on here. Although the workaround is effective, I am writing this post in hopes of coming back to it with a real solution & understanding.

 

References:

https://www.npmjs.com/package/rc-slider