Tech: Rails, SQLite, ActiveRecord, Ruby gems activerecord-import, axlsx, and axlsx_rails
Challenge: a co-worker is facing a data corruption issue. Don’t ask me all the details. The current situation… we have two spreadsheets (one with 985K records, the other with 1.4 million records) and need to identify rows/records that are not present in both spreadsheets.
We initially tackled in Microsoft Access but performance issues became a real nuisance. Access would freeze on file upload, when scrolling through query results, etc.
Additionally, we were seeing funny results. We want to
LEFT JOIN on 4 columns, and include those 4 columns in the
WHERE clause (column
IS NULL). Joining and where-ing on 1 column, then 2 and 3 columns brought back the expected record count. However, joining on all 4 returned curious results — just the records in “table A” of the left join, which didn’t make sense.
Solution: Since working with a DB administrator wasn’t a viable option, I though hard on what tools I could use to improve performance while investigating the issue. I have used the above mentioned tech for database & excel work, and although wary of the size of the excel sheets & corresponding query results, I decided to give it a try. Couldn’t be worse than Access, and it would also help verify our query results.
First, load excel records into SQLite w/ ActiveRecord. Fairly easy with activerecord-import. The only issue was the large number of records… but that was easily solved by batching the uploads. In
/lib/seeds/, create a
seeds.rb file and add:
Second, setup the export route in controller with corresponding SQL:
And the axlsx export file in views:
Conclusion: the results look promising but more research is needed. If the 4th column (COMPL_DTE) is included in the ON column, we just see all the records of Table A and not what we want… the records from Table A that have no corresponding record in Table B. Additionally, the export out to excel took a long time! The export from orphaned Table B records, approximately 785K, took 3-4 hours.