AWS has become a fairly ubiquitous hosting option for small companies. But developers typically work on local dev setups outside the host and occasionally need access to the main dev and staging databases located on AWS. How do you do this and still run in a secure environment?
It's ubiquitous on data driven web sites: the result set pager. We've all used them whether we built them from scratch or used one provided by the framework.
Pagers are by nature performance suckers because we're asking the database to re-run the same query for each "page", slicing off just one set of contiguous rows for each page. If your result set is 10,000 rows long but you're only paging through them 10 rows at a time, that's potentially 1,000 database requests to view the entire set.
But it's worse than that because in order to provide those nifty pager controls, like those in the image above, the software has to know how many rows are in the larger result set so it can do the math to populate the navigation for those page numbers. In other words, using the above example the software needs to know that there is a Page 14 to jump to.
A little background first. Internally, garden variety pagers are pretty much the same. They request a fixed number of rows to display per page, like 10. That becomes your LIMIT filter in the database query:
SELECT * FROM people LIMIT 10;
To create the page navigation you need to do some math to generate the OFFSET. For instance, using a page size of 10, the query for a Page 3 display would look like this:
SELECT * FROM people OFFSET 20 LIMIT 10;
Any developer who has worked with HIPAA compliancy knows that the law is murky at best and the fed doesn't publish a programmers' guide to make your life any easier. However, one of the cardinal rules is the requirement to keep track of who sees HIPAA data, who modifies it and when this was done. Another is that if you delete/update patient data you need to log what was deleted/updated in order to provide an audit trail, if only for the lawyers. Failure to do so can subject a company to some pretty draconian penalties.
This creates a challenge on the database side because SQL UPDATE obliterates a record's history. There are a few potential solutions, such as maintaining a changelog which such updates are written based on table triggers. I've done this but the log of atomic changes can grow immense. It's also difficult to reconstruct a large record based on potentially dozens or even hundreds of changes to records which must be retained for up to six years. That's how a traditional RCS would handle rollbacks but it's not practical inside the confines of a database.
Nevertheless, a resource control system (RCS) approach is what's needed, where a SQL UPDATE would maintain a copy of the pre-updated record and freeze it from further changes. RCS does its work by storing just the changes, or diffs, made to a document. While it would be technically possible to do this with a database record -- for instance, using a BLOB in a sibling table -- there's a simpler and more practical method that also maintains relational integrity.
One issue I run across occasionally is a table with duplicate entries such as two entries for the same company in an accounts payable system. This can create embarrassing problems with billing if ACME Inc #1 is 90 days overdue because someone posted a payment, and now a credit, to ACME Inc #2.
A common reporting requirement in database applications is selecting the latest purchase, or latest job, or oldest chlld for each person in a table. Any beginning student of SQL knows how to return the most recent date in a table:
SELECT MAX(rec_date) FROM sales;
But that doesn't tell us which person owns that record. We can do this using a sub, or nested, query on the same table:
SELECT cust_name FROM sales WHERE rec_date = (SELECT MAX(rec_date) FROM sales);
cust_name | rec_date
Dwight Schrute | 2011-03-15 18:04:45.178057-04
This could return multiple records with the same rec_date. But let's say we're not interested in who owns the most recent record. Instead, we want to know the latest record for everyone in the table, such as the last sales date for every customer. Here's one of the possible solutions for that: