data-intensive

Book: Designing Data-Intensive Applications

December’s Book: Designing Data-Intensive Applications by Martin Kleppmann

Why this book: I want a deep dive into databases, and to think more along the lines of a software/systems engineer. 

Takeaway: this a huge, detailed textbook. I used a highlighter liberally. For this post, I want to create a quick list of common, critical questions & topics (basically, an outline!).

  1. Reliability, Scalability, Maintainability
  2. What is the load on the system (think bottleneck)?
  3. Relational Model vs. Document Model (RDMS vs. NoSQL)
  4.  What to index? 
  5. How to log?
  6. Compatibility – Backward, Forward
  7. Encoding… JSON, XML, Protocol Buffers, Avro, etc.
  8. Distributed Data – Scalability, Fault Tolerance/Availability, Latency
  9. Replication: single-leader, multi-leader, leaderless
  10. Replication – synchronous vs. async
  11. Partitioning (Sharding) combined with Replication
  12. ACID: Atomicity, Consistency, Isolation, Durability
  13. Dirty Reads, Dirty Writes
  14. Serialization 
  15. Distributed Systems problems: unreliable networks, faults/partial failures, timeouts, unreliable clocks, process pauses
  16. Consistency through Linearization
  17. Systems of Record vs. Derived Data Systems
  18. Batch Processing vs. Stream Processing
db

What kind of SQL … ??

Tech: Relational Database Management Systems

Challenge: I wanted to take a few moments to read, research, and write down the differences in relational databases. I have experienced this many times – when asking a colleague a high-level conceptual database question, the initial response is “what kind of database is it?”… I don’t always think that is relevant considering the question, but nonetheless I’d like to know the key advantages & disadvantages among these popular systems.

SQLite

  • Advantages
    • light-weight, easy to embed in software
    • very fast read/write
    • no install/configure
    • minimal bugs
    • serverless
    • open-source
  • Disadvantages
    • restricted to 2GB in most cases
    • can only handle low volume transactions
    • no concurrent writes (bad for write-intensive)
    • not multi-user
    • no data type checking (can insert string into integer field)

 
MySQL (side note: LAMP-stack… Linux, Apache, MySQL, PHP)

  • Advantages
    • open source
    • most secure & reliable – used by WordPress, Facebook, Twitter
    • great data security & support for transactional processing (eCommerce)
    • configurable for flawless performance
    • optimized for web applications
    • can run on all major platforms, and supports apps in all popular languages
    • easy to learn
  • Disadvantages
    • might not be great for high concurrency levels
    • helmed by Oracle now… progress has slowed/halted
    • key features are applications and add-ons (text search, ACID compliance)
    • can be limited in areas such as warehousing, fault tolerance, performance diagnostics

 
PostgreSQL

  • Advantages
    • open source
    • exhaustive library & framework support
    • superior query optimizer (great for complex data models)
    • built-in NoSQL key-value store
    • can practically be used for any data problem situation
    • integration with Heroku
    • very reliable
  • Disadvantages
    • so expansive, can be tough to learn
    • slower than MySQL

 
Oracle

  • Advantages
    • all instances backward compatible
    • high functionality for large data sets – many large international banks utilize Oracle
    • high data integrity (aces ACID test)
    • efficient data-recovery tech (Flashback)
    • supports cursors (making programming easier)
  • Disadvantages
    • expensive
    • complex
    • high level of expertise typically needed to properly administer

 
Microsoft SQL Server (MS SQL)

  • Advantages
    • enterprise level 
    • excellent data recovery
    • easy to install
    • great security features (security audits, events can be automatically written to log)
    • great scalability
    • integration w/ .NET framework
  • Disadvantages
    • expensive
    • limited compatibility (Windows-based)
    • uses custom core language

 

Reference:

Wikipedia: RDMSs

 

Rails: Migration Add Default Value to Column

Tech: Rails, Ruby, Heroku

Challenge: normally, my Rails’ apps are purely hobby and there is little-to-no need for adding change migrations to the existing schemas. Releasing Shepic! changed that. Heroku is running a production application, and I need to add a default value to the ‘admin’ boolean column in the User table. 

Code:

From the CLI, we create the change file:

rails generate migration AddDefaultValueToUserAdmin

Resulting in:

rails_migration_change_file

 

Commit the change, and push to Heroku:

git push heroku master

 

And migrate!

heroku run rake db:migrate

heroku restart

 

Reference:

http://guides.rubyonrails.org/active_record_migrations.html#creating-a-standalone-migration

https://stackoverflow.com/questions/7098602/add-a-default-value-to-a-column-through-a-migration

 

Error! Is port 5432 allocated?

Tech: PostgreSQL, Docker, Rails, Ruby, grep

Challenge: seeing error messages when launching PostgreSQL. I have been seeing this in a few different scenarios. One is when Docker is/was running a PG container, and the computer has slept or been closed. 

HINT:  Is another postmaster already running on port 5432? If not, wait a few seconds and retry.

LOG:  could not bind IPv4 socket: Address already in use</code>

HINT:  Is another postmaster already running on port 5432? If not, wait a few seconds and retry.

WARNING:  could not create listen socket for "localhost"

FATAL:  could not create any TCP/IP sockets

LOG:  database system is shut down

Code:

In the case of a Docker/Rails Server conflict, I have been grepping for PG instances and shutting them down.

ps -ef | grep postgres

What’s going on here? …

ps = process select

-ef = every process

Let’s take out PG and start over…

sudo pkill -u postgres

 

Challenge: another time, a similar error when the Rails database file was not fully configured.

could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/pgsql_socket/.s.PGSQL.5432"?

Code:

I tried the above, encountered the same problem, and then looked to the database.yml file:

simply add… host: localhost

 

Reference:

http://stackoverflow.com/questions/18721149/check-if-a-particular-service-is-running-on-ubuntu

http://askubuntu.com/questions/547434/how-to-nicely-stop-all-postgres-processes

API / Http Request “FitBit”

Tech: Ruby, Dashing, gem Rest-Client

Challenge: I have a project that makes a standard http request each minute. I want to monitor the health of the connection… is the connection/request/response regularly resolving? If it is failing, what percentage of the time?

Code:

In the connection feed class, initialize an array with 60 zeros. Each array element will represent a success/fail call. 60 elements = 60 minutes.

@failed_requests = Array.new(60, 0)

 

The request can fail from time to time for various reasons. I build in a rescue and retry clause:

begin

retry_count = 0

rescue StandardError => error

retry_count +=1

if retry_count < 3

retry

else

# 3 retries constitutes a failed request

# push 1 into array, representing “true” and setting up for percentage calculation

# array grows to 61 elements

# following FIFO… shift out array[0]… back to 60 elements

@failed_requests.push(1).shift

end

end

 

In the Dashing portion of the project, we check the array every minute for status:

SCHEDULER.every ‘1m’, :first_in => 0 do |job|

failed_req_count  = @failed_requests.inject(:+)

failed_percentage = ((failed_req_count.to_f / 60.to_f) * 100).to_i

if failed_percentage >= 10

# send alert email!

end

end

 

Note: this code features some pseudo elements. The actual workflow includes logging, net/stmp for emailing, and a timer once an alert is sent (with a 1 hour delay between alerts).

 

Heroku: Scheduling DB Backups

Tech: Heroku, PostgreSQL, Rails, Ruby

Challenge: setup and verify production database backup

Code:

brew install heroku (if not already installed)

git remote add heroku https://git.heroku.com/app_name.git

(you can find the remote URL in Settings, then Info)

 

run a one-time backup => heroku pg:backups:capture –app app_name

check last backup => heroku pg:backups:info

 

and scheduling a regular backup….

heroku pg:backups:schedule DATABASE_URL –at ’05:00 America/New_York’ –app app_name

(Note: DATABASE_URL is a Heroku config variable. If your CLI is properly configured to connect to Heroku, then the literal string ‘DATABASE_URL’ is the command. If you want to see the DB URL to back up, head to Settings, then Config Variables.)

Success!

screen-shot-2017-02-27-at-2-16-12-pm

 

view your backup info:

heroku pg:backups –app app_name

(A few days after writing this post, I checked back in on the progress…. )

screen-shot-2017-03-03-at-12-30-21-pm

 

bonus, set the timezone of the Heroku PG instance:

heroku config:add TZ=”America/New_York”

 

Reference:

https://devcenter.heroku.com/articles/heroku-postgres-backups#creating-a-backup

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

Kinvey API Console: Adding Users to Group

Tech: Kinvey, Angular2, JS

Problem: users in Kinvey data collection have field “isAdmin” for displaying Angular2 dynamic page elements. However, when a TRUE “isAdmin” user logs in, the proper elements are not displaying.

kinvey_api_ng2_ngif

 

Solution: admin users need to be added to the Kinvey group “admin” due to a postFetch hook.

kiney_api_postfetchhook

Since the users were not in the group “admin”, their field value for “isAdmin” was being overwritten to FALSE in the postFetch hook.

 

Code: let’s take a look at the Kinvey API Console.

GET /appdata/KINVEY_APP_KEY/ => returns high-level project data

GET /appdata/KINVEY_APP_KEY/project => returns data from collection “project”. Pass in a data-collection name here to see what’s available.

GET /group/KINVEY_APP_KEY/admin => returns data from the group admin. Referencing the above example, this was initially empty:

kinvey_api_no_group_found

 

Let’s create and add users to an Admin group.

POST /group/KINVEY_APP_KEY

{

“_id”: “admin”,

“users”: {

“all”: “false”,

“list”: [

{ “_type”: “KinveyRef”, “_collection”: “user”, “_id”: “########” },

{ “_type”: “KinveyRef”, “_collection”: “user”, “_id”: “########” }

]

},

“groups”: [

{“_type”: “KinveyRef”, “_collection”: “group”, “_id”: “G1”},

{“_type”: “KinveyRef”, “_collection”: “group”, “_id”: “G5”}

]

}

 

Reference Note: pretty strong resources online. The aim of this post is for my own reference in relation to solving a bug we were experiencing.

http://devcenter.kinvey.com/rest/guides/users#usergroups

App Launch! ==> Research Job Board (pt. 1)

Concept: Within a large organization, each department has unique workflow on advertising for research positions, as well as gathering potential applicants. The Research Job Board standardizes and centralizes the process for all departments with a simple CRUD interface for posting a job, and a comprehensive search for potential applicants. Very easy to apply for a position, and functionality in place for future expansion so an applicant can manage their activity within the app. Developed as a MVP, with a soft version 1.0 launch imminent to gather feedback and evaluate potential organization-wide implementation.

rvjb_search

 

Stack:

  • Language: Ruby
  • Framework: Rails, Docker, Docker Compose
  • Database: PostgreSQL, Oracle
  • CSS: Twitter Bootstrap
  • Gems: devise, simple_form, paperclip, simplecov, hirb, faker, actionmailer, dotenv
  • Additional Tools: Trello

 

rvjb_search_results

 

Challenge: due to the nature & preferences of the development docker server, I initially used a PostgreSQL database. This did help me deploy quickly because of past experience and established patterns from the dev-ops team. However, when put into production, the database choice is Oracle… so I developed in both environments on different branches to ensure a smooth launch.

 

rvjb_posting_list

 

Takeaway: I began this project over a year ago in an effort to keep progressing & learning as a developer outside of my 9-5 dev job. (The majority of the work being done in 50 hrs for initial review & delivery.) Little did I know it would lead to a great position within a great organization. I have 5 priorities in life… one of them is a continuous commitment to learning. It really paid off here! 

Note: since this is a MVP, little effort was put into design of the site. Focus was on functionality and delivery.

rvjb_apply

 

Reference: Many of the concepts I used came from learning through codeplace.com.