A common reporting requirement in database applications is selecting the latest purchase, or latest job, or oldest chlld for each person in a table. Any beginning student of SQL knows how to return the most recent date in a table:
SELECT MAX(rec_date) FROM sales;
max
——————————-
2011-03-15 18:04:45.178057-04
(1 row)
But that doesn’t tell us which person owns that record. We can do this using a sub, or nested, query on the same table:
SELECT cust_name FROM sales WHERE rec_date = (SELECT MAX(rec_date) FROM sales);
cust_name | rec_date
------------+-------------------------------
Dwight Schrute | 2011-03-15 18:04:45.178057-04
(1 row)
This could return multiple records with the same rec_date. But let’s say we’re not interested in who owns the most recent record. Instead, we want to know the latest record for everyone in the table, such as the last sales date for every customer. Here’s one of the possible solutions for that:
SELECT cust_name, MAX(rec_date) FROM sales GROUP BY cust_name;
That works, although typically you would want to assign a unique numeric ID to each person in a separate normalized table because eventually you’re going to wind up with at least two different John Williams and this query will return the latest record for only one of them.
A second problem with this approach is that you can’t add more reporting data to that query. For instance, let’s say you also want to add the person’s telephone number:
SELECT cust_name, phone, MAX(rec_date) FROM sales GROUP BY cust_name, phone;
This will work until someone changes his phone number. Then you’re going to see him reported twice because of the GROUP BY clause. Normalizing a separate ‘person’ table with a single phone number will fix that but that doesn’t really address that inherent weakness in this query. What we need is something that works like second query but which will also let us take a bunch of unGROUPed data along for the ride.
There are several different approaches. Here’s one of them which uses a query similar to the previous one, but we put it in a subquery. In this case, we’re returning the latest chronological record for each person in a table:
SELECT T1.person_id, T1.phone, T1.rec_date, T1.cust_name
FROM sales T1,
(SELECT person_id, MAX(rec_date) AS max_date
FROM sales GROUP BY person_id) T2
WHERE T1.person_id = T2.person_id AND T1.rec_date = T2.max_date;
This approach also gives us the flexibility of being able to pull data from other tables for our report using a JOIN:
SELECT C.company_name, T1.person_id, T1.phone, T1.rec_date, T1.cust_name
FROM sales T1,
(SELECT person_id, MAX(rec_date) AS max_date
FROM sales GROUP BY person_id) T2
LEFT JOIN company AS C ON (C.company_id = T1.company_id)
WHERE T1.person_id = T2.person_id AND T1.rec_date = T2.max_date;
Last month we were given a tricky reporting requirement by a client. The source is a large database of high school students who visited school health clinics. The client wanted only the latest health clinic encounter reported for each student for the selected month AND selected year, to be presented in a tabular format.
I knew from the regularity of the report that the last part of the requirement would be satisfied with a SQL UNION, with one query reporting the selected month and the other reporting the current year. The issue was going to be the date ranges for those two queries. Since those were coming from the application, which is written in PHP, I added those variables to the query. (Note: these variables could and should be inserted using SQL placeholders in the query and a bind, which is the safer way to do this, I didn’t do that here for simplicity).
This PostgreSQL database is unusual in that because it’s medical data it has HIPAA restrictions requiring a clear audit trail of record updates of what was changed, when and by whom. This is accomplished by never using SQL UPDATE to make changes to a record. Instead, the current record is copied to a new one, the updates are made to the latter and the old record is deprecated by setting its entity_status to “inactive. The queries filter out all entity_statuses with aren’t marked “active”. A normalized parent table maintains the publicly displayed and consistent patient record number.
An additional requirement was that they wanted the report to target single school clinics out of possible 40+ in the database.
The query we delivered was this:
function report_results($report_month, $report_year, $site_id)
{
$school_year_start_date = $report_year . ‘-07-01’;
$school_year_end_date = strval($report_year + 1) . ‘-06-30’;
// If we’re into the next calendar year, increment year for the month query.
if (intval($report_month) < 7) {
$report_year++;
}
$school_month_start_date = sprintf(‘%s-%s-01’, $report_year,$report_month)
// Figure out the last day of the month.
$last_day = date(‘t’, strtotime($school_month_start_date));
$school_month_end_date = sprintf(‘%s-%s-%02d’, $report_year, $report_month, $last_day);
/************************
* Patient count query
************************/
$sql =<<<END_OF_SQL
SELECT
‘month’ AS utype,
SUB1.*,
ED.*,
PD.*
FROM (
SELECT
SUB2.person_id AS sub1_person_id,
SUB2.visit_date AS sub1_visit_date,
MAX(ED.encounter_id) AS sub1_encounter_id
FROM (
— SELECT the most recent appt visit for each person during reporting period
SELECT
MAX(ED1.visit_date) AS visit_date,
PD1.person_id
FROM
encounter_detail ED1
LEFT JOIN person_detail AS PD1 ON (ED1.person_id = PD1.person_id)
WHERE
visit_date BETWEEN ‘$school_month_start_date’::DATE AND ‘$school_month_end_date’::DATE
AND PD1.entity_status = ENTITY_ACTIVE()
AND ED1.entity_status = ENTITY_ACTIVE()
AND ED1.site_id = $site_id
GROUP BY PD1.person_id
) SUB2
LEFT JOIN encounter_detail AS ED ON (ED.person_id = SUB2.person_id)
WHERE ED.entity_status = ENTITY_ACTIVE()
AND ED.visit_date = SUB2.visit_date
GROUP BY SUB2.person_id, SUB2.visit_date
) SUB1
LEFT JOIN encounter_detail AS ED ON (ED.encounter_id = SUB1.sub1_encounter_id AND ED.entity_status = ENTITY_ACTIVE())
LEFT JOIN person_detail AS PD ON (PD.person_id = ED.person_id AND PD.entity_status = ENTITY_ACTIVE())
UNION
SELECT
‘year’ AS utype,
SUB1.*,
ED.*,
PD.*
FROM (
SELECT
SUB2.person_id AS sub1_person_id,
SUB2.visit_date AS sub1_visit_date,
MAX(ED.encounter_id) AS sub1_encounter_id
FROM (
— SELECT the most recent appt visit for each person during reporting period
SELECT
MAX(ED1.visit_date) AS visit_date,
PD1.person_id
FROM
encounter_detail ED1
LEFT JOIN person_detail AS PD1 ON (ED1.person_id = PD1.person_id)
WHERE
visit_date BETWEEN ‘$school_year_start_date’::DATE AND ‘$school_year_end_date’::DATE
AND PD1.entity_status = ENTITY_ACTIVE()
AND ED1.entity_status = ENTITY_ACTIVE()
AND ED1.site_id = $site_id
GROUP BY PD1.person_id
) SUB2
LEFT JOIN encounter_detail AS ED ON (ED.person_id = SUB2.person_id)
WHERE ED.entity_status = ENTITY_ACTIVE()
AND ED.visit_date = SUB2.visit_date
GROUP BY SUB2.person_id, SUB2.visit_date
) SUB1
LEFT JOIN encounter_detail AS ED ON (ED.encounter_id = SUB1.sub1_encounter_id AND ED.entity_status = ENTITY_ACTIVE())
LEFT JOIN person_detail AS PD ON (PD.person_id = ED.person_id AND PD.entity_status = ENTITY_ACTIVE())
END_OF_SQL;
….