Things I Learned Building a Gem

Tech: Ruby, Rubocop, Splunk, Docker, CI, Rspec, TigerConnect

Challenge: a particular component (HIPAA-compliant text messaging) of an application proved valuable in other applications. The code was organically customized and integrated into several applications. Eventually, we had a clear case to pull out the duplicated code (DRY!) from the various repos, standardize, and package as a gem.

What I learned:

  1. send is a Ruby reserved keyword (for dynamically calling method names)
  2. How to pass creds for a private GitLab/GitHub repo into Gemfile, so the gem can be implemented (gitlab-ci-token)
  3. After updating the gem, and using bundle update in an app implementing the gem, sometimes you must force pull from the private repo to get the latest changes.
  4. GitLab CI w/ RSpec, Rubocop!
  5. garbage collection https://blog.codeship.com/visualizing-garbage-collection-ruby-python/
  6. Splunk search is NOT case sensitive!

 

And topics/pitfalls for further research:

  1. Oddly, my Mail settings took in the local environment, but not in the application Docker image utilizing the gem
  2. In the local environment, one dynamic require all in the main module worked nicely for files in ./lib. However, needed multiple require statements (one for each file) when implementing the gem elsewhere.
  3. Ignoring Gemfile.lock in commits was helpful when trying to implement gem in applications.

 

Reference:

Generic Gem Template

Building a Gem Guide

Naming, Versioning, Dependencies, etc.

Bundler – creating gem

GitLab Token

computer_science_distilled

Book: Computer Science Distilled

November’s Book: Computer Science Distilled by Ferreira Filho

Why this book: I do not have a degree in computer science and this looks to be a good primer to get my mind grapes flowing. My college major, Operations & Information Systems Management, did explore some of these concepts but not in depth. I think it is time for a refresh and a deeper dive.

Final, Final Takeaways:

  • Each chapter ends with a reference list of books to further explore the chapter’s topics. 
  • The colophon states the cover image is from an 1845 schematic by Charles Babbage; the first programmable computer!

Notes/Thoughts:

Chapter 1: Basics

  • Flowchart
    • states & instructions = rectangle
    • decision step = triangle
  • Factorials get BIG FAST. 10! = 3,628,800
  • Fun fact: human DNA has about 3 million base pairs, replicated in each of the 3 trillion cells of the human body
  • Final Takeaways
    • Great refresher on mathematical terms/concepts: truth tables, permutations (with/without identical items), combinations, probabilities: counting and independent, complementary, and mutually exclusive events. Definitely can use this chapter as a reference!
    • XKCD is a great comic
    • The Zebra Puzzle is fun

Chapter 2: Complexity

  • Two main components when considering what an algorithm will “cost”: time and memory (ongoing calculations take up space)
  • Exponential algorithms (2^n) are much more prohibitively expensive than quadratic algorithms (n^2)
  • Final Takeaways
    • I want to learn more about Big O notation – I recognize the term & idea, and feel it is the most vital content of the chapter 
    • I have a book in mind for December (Designing Data-Intensive Applications), but Big O would be a good topic for another month. Eventually, I’d like to be a software or systems engineer and obviously knowing more in this realm will be key!

Chapter 3: Strategy

  • merge algorithm O(n)
    • fixed number of operations
    • think 2 lists of fish, sorting alphabetically
  • power set algorithm O(2^n)
    • double the operations when input size increases by 1
    • power set = truth table
    • think fragrance combinations from a set of flowers
  • recursion
    • remember the base case
    • think palindrome or fibonacci function
    • recursion vs. iteration => it’s a trade-off
      • more expensive: recursion
      • faster speed: iteration
      • higher complexity: iteration
  • brute force (exhaustive search) O(n^2)
    • number of pairs in an interval increases quadratically as interval increases
    • think best trade: in one month, find the day where buying & selling nets the most profit. OR find the optimal buy/sell timeframe
  • backtracing
    • think 8 Queens Puzzle
    • use recursion… once you get a false value, rollback to the last true value and try again
    • “fail early, fail often”
  • heuristics
    • method that leads to a solution that is good enough
    • think chess (man vs. computer): after your first 4 moves, 288 billion possible positions, wow! find a move that is good enough.
    • Greed Approach
      • make best choice at each step, and don’t look back
      • think burglar filling a knapsack. No time to remove or consider items already in the knapsack
  • divide and conquer
    • look for optimal substructure and use recursion
  • memoization
    • think about knapsack… some calculations done repeatedly
    • store the result of the repeated calc
  • branch and bound
    • divide into subproblems
    • find upper & lower bounds of each subproblem
    • compare subproblem bounds w/ all branches
  • Final Takeaway
    • will be very helpful to return to this chapter when tackling a data-parsing problem

Chapter 4: Data

  • Abstract Data Types: how variables of a given data type are operated
    • stack – LI,FO
    • queue – FI,FO
    • list – more flexible than stack or queue; many available data operations
    • sorted list – fewer operators than list, but items always ordered
    • map – stores mappings with a key and value (kinda like a Hash?)
    • set – unordered group of unique items
  • Structures: how variables/data organized & accessed
    • array – instant access, sequential memory space, but can be impractical with large data sets
    • linked list – each item has a pointer to memory location of next item
    • double linked list – each item has pointers in both directions
      • for either linked list… can not find specific nth item in list
    • array vs. list => it’s a trade-off
      • faster insert/delete: list
      • insert/delete at random points: list
      • random, unordered data access: array
      • extreme performance access: array
    • tree
      • think about traversing HTML
      • nodes & edges
      • root node – no parent
      • leaf node – no children
      • height – level of deepest node
    • binary search tree
      • at most, each node can have only 2 children
      • left node < parent, right node > parent
      • binary heap – tree where parent must be greater (or smaller) than both child nodes
    • graph – no child, parent, or root node! most flexible structure!
    • hash – each item is given a memory position; still needs a large chuck of memory set aside
  • Final Takeaway
    • How is a hash different from a map? I know one falls under how data is operated, and the other organized/accessed, but the definitions feel very similar.

Chapter 5: Algorithms

  • Most important, an efficient algorithm likely exists already to solve your issue!
  • Lots of sorting algos: Selection, Insertion, Merge, Quick
  • Searching: Sequential, Binary, or use a Hash!
  • Graphs: Depth First Search vs. Breath First Search => it’s a trade-off!
    • DFS – down, using a Stack
    • BFS – across, using a Queue
    • Simple, less memory: DFS
    • DFS if need to explore all graph nodes
    • BFS if expected location is close to the root
  • classic problem: find the shortest path between nodes
    • try Djkistra Algorithm
    • uses a priority queue (as opposed to BFS, which is an auxiliary queue)
    • huge area? try Bidirectional Search
  • Google, PageRank Algorithm
    • modeled the web as a graph
    • web page = node, links = edges
    • the more edges a page has, the higher the rank! 
  • network, workflow, cost issues? Linear optimization problems are likely best solved with Simplex Method
  • Final Takeaways
    • I think the key here is first to see how your data is modeled, and then look for a algorithm that matches the problem at hand
    • Beware of choice! Successful algorithms can have pitfalls, drawbacks

Chapter 6: Databases

  • The definition of Normalize, Normalization is not what I have used/heard
    • book: The process of transforming a database with replicated data to one without
    • familiar: prepare and/or flatten data for ease of consumption
    • I think I have been wrong here (the very reason I am writing this blog & reading these books lol)!
  • An index is basically a self-balancing binary search tree
  • NoSQL (btw, it can be pronounced either way)
    • most widely known type: document store
      • a data entry contains all info app needs 
      • data entry: document
      • group of documents: collection
  • Graph Databases
    • data entry = node, relationship = edge
    • the most flexible type of DB!
    • data modeled like a network? Graph is likely best
  • Big Data: Volume, Velocity, Variety, (Variability, Veracity)
  • SQL vs. NoSQL => it’s a trade-off!
    • data-centered: SQL
    • maximize structure & eliminate duplication: SQL
    • application-centered: NoSQL
    • faster development: NoSQL
    • time spent on consistency: SQL
  • Distributed
    • Single-Master Replication
      • master receives all queries
      • forwards to slave (which contains a replica of DB)
    • Multi-Master Replication
      • load balancer distributes queries
      • all masters connected
      • write queries propagated amongst all masters
    • Sharding
      • each computer has portion of DB
      • query router send query to correct computer
      • use with replication to avoid one shard failing and having portion of DB unavailable
    • be wary of Data Consistency; Eventual Consistency (many writes among distribution take time to catch up & sync) might not be good enough
  • Serialization: SQL, XML, JSON, CSV

Chapter 7: Computers

  • bus: group of wires for transmitting data (think individual RAM component)
    • address bus: transmit address/location data (unidirectional)
    • data bus: transmit data to and from
  • register: internal memory cell in CPU
  • instruction set: collection of all operations
  • at core of never-ending CPU cycle is Program Counter (PC or BIOS)
    • stores the memory address of next instruction to be executed
    • special register
    • will hold immutable core logic for computer startup
  • CPU clock: number of basic operations per second
    • 2 MHz = two million operations/second
    • quad-core 2 GHz = close to a billion operations/second
  • bit architecture
    • 4 bit = processing binary number instructions up to 4 digits 
    • 8 bit = up to 8, 32 bit = 32 digits, etc
    • thus a 64-bit program can’t run on 32-bit
    • 64-bit register = 2^64 = over 17 billion gigabytes
  • endian
    • little-endian = store numbers left-to-right
    • big-endian = right-to-left
  • compiler
    • converts programming language into machine instructions
    • turing-complete: read/write data, performs conditional branching
    • scripting languages (JS, Ruby, Python) use an interpreter to skip compiling (much slower! but immediate code execution)
    • once compiled, original code is impossible to recover
    • BUT is possible to decode the binary (disassembly). this is reverse-engineering, and frequently how programs are hacked (think pirated software that bypasses auth/download code)
  • memory hierarchy
    • Processor-Memory Gap: RAM is slower
    • the following help bridge the gap
      • Temporal Locality: if used once, likely to be used again
      • Spatial Locality: when address used, near-by addresses likely to be used shortly
      • L1/L2/L3 Cache: contents of memory with high probability of being accessed
    • Main Memory (RAM) – primary
    • Main Storage (DISK) – secondary, could be tape or SSD
  • Final Takeaway

Chapter 8: Programming

  • Values are first-class citizens. I never thought of it this way; I always associated the term “first-class” with functions & JavaScript
  • Paradigms
    • Imperative
      • first!
      • exact instructions at each step with specific commands
      • Machine Code: mnemonic instructions (ex – CP, MOV using Assembly ASM)
      • StructuredGOTO, JUMP to control execution flow, eventually conditionals (for much better control)
      • Procedural: an advancement, allowing dry code & reusability
    • Declarative
      • what you want, not how
      • Functional:
        • functions are first-class, & thus higher-order functions
        • closures 
          • can “remember” a var, and access it in future 
          • allow for clean approach to global var
    • Logic: best for AI, natural language processing
  • Final Takeaway
    • Once, I failed an interview code challenge. I now know why because I did not understand or know about closures!

Pattern: Iteration within Interation (pt. 2)

Tech: Ruby

Challenge: returning to iterating! Last time, I wanted the sub-iteration to skip records already looked at, as well as skip the index forward. Let’s gooo!

Code:

i = 0

patients.each_with_index do |row, index|

next if i > index # will skip through rows already processed in the while loop

# using while equal makes it a little easier to understand than part 1’s solution

while patients[i][:id] == patients[index][:id]

# i will increase, and as long as it keeps matching our entry index :id; we are within the same patient’s records

# do logic

i += 1

break if !patients[i] # no more records

end

end

 

Takeaway: I previously implemented part 1’s solution, and this solution in production code. I wonder if there is a scenario where part 1 would be optimal comparatively (I think no because rows are redundantly processed).

okr

2018 Commitment: Update w/ OKR

Wow, with 2 months to go in 2018, I am nowhere near my stated new year’s goals. This has been the busiest year of my life… In addition to getting married in August (a huge time commitment!), I traveled to 17 cities for business, pleasure, and weddings (five total; one of which I was a groomsman). I am disappointed I did not come close to my 2018 stated goals but, looking back, the year has been productive!

Work projects have increased and provide daily challenges. This year, I have worked on Blue Button integration, an Angular project fork, a webapp rewrite in React (from a ColdFusion/PHP stack), and learned & launched applications with new-to-me technologies, Redis and TigerConnect. Frankly, the “9 to 5” work has been demanding enough to satisfy my learning hunger at night & on the weekends. Outside of the daily grind, I released a “live scoring” feature for my college football pick-em app, and won a hackathon

Recently, I have been reading a very interesting book, Measure What Matters by John Doerr. The theories & practices put forth by Doerr enable rapid growth and exceeded expectations. I feel I am a third (eh, maybe half) of the way there… I set the goals, but a yearly timeframe is too long. I set deliverables, but typically very large ones (like release v 1.0 of app!). So, taking Measure What Matters to heart, I am revising my goals into 2 month chunks, and looking to map out the next 6-8 months.

Ostinato Rigore!

slider

Undesired rc-slider onAfterChange Event

Tech: javascript, npm rc-slider, react

Challenge: The webapp designer choose to have both the rc-slider’s initial state (null), and the lowest actual value (1), be represented in the lowest/left-most node. If the User simply clicks the left-most node, the null value updates to 1 through the RcSlider onAfterChange API. The problem is whenever the next click occurs… anywhere on the page OR on any element… onAfterChange fires again, which is undesired!

Code:

The rc-slider is wrapped with a custom element. We want the rc-slider onAfterChange event to bubble up, so we pass down our “onClick” function. (Note: these elements & functions have been stripped for ease of explanation.)

onClick function:

onSliderClick = (factorIndex, value) => {

if (value === 0) {

this.props.sliderClickedFromDefault(factorIndex, 0)

}

}

React elements:

parent: <Slider onSliderClick={() => onSliderClick(factorIndex, value)} />

child: <RcSlider onAfterChange={onSliderClick} />

 

The initial click of the rc-slider works great! During troubleshooting, we capture the document.activeElement:

document_active_element_1

 

It is the second click, anywhere on the page OR on any element, that causes the issue. Here is the Redux Inspector Action Event Log:

react_log

 

After reading through known rc-slider issues, and speaking with the Great Google, I have not yet found a solid solution or explanation as to why this is happening. My workaround…

Let’s check the activeElement the second time onAfterChange fires. If it is not my desired element (the slider!), do not dispatch an action!

onSliderClick = (factorIndex, value) => {

const activeClass = document.activeElement.className

if (value === 0 && activeClass === 'rc-slider-handle rc-slider-handle-click-focused') {

this.props.sliderClickedFromDefault(factorIndex, 0)

}

}

 

Takeaway: I do not love this solution and am convinced something else is going on here. Although the workaround is effective, I am writing this post in hopes of coming back to it with a real solution & understanding.

 

References:

https://www.npmjs.com/package/rc-slider

Recovering Data with ActiveRecord, AXLSX Gem

Tech: Rails, SQLite, ActiveRecord, Ruby gems activerecord-import, axlsx, and axlsx_rails

Challenge: a co-worker is facing a data corruption issue. Don’t ask me all the details. The current situation… we have two spreadsheets (one with 985K records, the other with 1.4 million records) and need to identify rows/records that are not present in both spreadsheets.

We initially tackled in Microsoft Access but performance issues became a real nuisance. Access would freeze on file upload, when scrolling through query results, etc.

Additionally, we were seeing funny results. We want to LEFT JOIN on 4 columns, and include those 4 columns in the WHERE clause (column IS NULL). Joining and where-ing on 1 column, then 2 and 3 columns brought back the expected record count. However, joining on all 4 returned curious results — just the records in “table A” of the left join, which didn’t make sense.

Solution: Since working with a DB administrator wasn’t a viable option, I though hard on what tools I could use to improve performance while investigating the issue. I have used the above mentioned tech for database & excel work, and although wary of the size of the excel sheets & corresponding query results, I decided to give it a try. Couldn’t be worse than Access, and it would also help verify our query results.

Code: 

First, load excel records into SQLite w/ ActiveRecord. Fairly easy with activerecord-import. The only issue was the large number of records… but that was easily solved by batching the uploads. In /lib/seeds/, create a seeds.rb file and add:

sean_seeds_import

 

Second, setup the export route in controller with corresponding SQL:

sean_export_route

 

And the axlsx export file in views:

sean_axlsx_export

 

Conclusion: the results look promising but more research is needed. If the 4th column (COMPL_DTE) is included in the ON column, we just see all the records of Table A and not what we want… the records from Table A that have no corresponding record in Table B. Additionally, the export out to excel took a long time! The export from orphaned Table B records, approximately 785K, took 3-4 hours.

 

Reference:

https://support.office.com/en-us/article/compare-two-tables-and-find-records-without-matches-cb20ad48-4eba-402a-b20d-eaf10a5d1cb4

https://stackoverflow.com/questions/6613708/how-can-i-join-two-tables-but-only-return-rows-that-dont-match

https://mattboldt.com/importing-massive-data-into-rails/

 

turtles

Pattern: Iteration within Interation (pt. 1)

Tech: Ruby

Challenge: need to iterate over a set of patient records…within the set, each patient has several records… so when the initial iteration gets to a new patient, the code needs to include a sub-iteration specific to the patient. For example, a patient’s subset of data includes their location & treatment history. We want to capture the patient’s last location, but their first instance of treatment. The data is extracted into a patient-specific data hash.

Code:

patients.each_with_index do |row, index|

i = index
until patients[i]['ID'] != patients[index]['ID']

update_patient_hash(row['ID'], data)

i += 1
break if !patients[i] # no more records

end

end

 

Takeaway: this could benefit from refactoring…

  1. The sub-iteration repeats the processing of records due to reference to the row for data capture, instead of patient[i][data]
  2. Could the initial iteration’s index be reset/moved to the new patient once the sub-iteration is finished?

Ruby Template => Data to JSON

Tech: Ruby, JSON, Docker

Challenge: data for daily reporting is captured in a class instance variable hash. If/when the docker container restarts, I want the data, which is regularly written to a JSON file, to be reloaded into the @hash.

Code:

def initialize

if FileTest.exists?('./data.json')

@instance_var_hash = JSON.parse(File.read('./data.json'))

else

@instance_var_hash = { ... }

end

end

Potential Pitfall: when sandboxing, I had an instance variable hash with symbol keys. When reading in the JSON file and writing to my instance variable, I generated an error in my methods… :key in original instance variable versus 'key' from incoming JSON.
 
Does the JSON file exist?

if FileTest.exists?('./file_name.json')

Read file

JSON.parse(File.read('./file_name.json'))

Open the file & write

File.open('./temp.json', 'w') do |f|

f.write(JSON.pretty_generate(object))

end

 
Reference:

https://github.com/lukekedz/ruby_json_template

http://ruby-doc.org/stdlib-1.9.3/libdoc/json/rdoc/JSON.html

http://ruby-doc.org/core-2.2.0/FileTest.html

https://stackoverflow.com/questions/5507512/how-to-write-to-a-json-file-in-the-correct-format/5507535

https://hackhands.com/ruby-read-json-file-hash/