This tip builds on the analysis of a 🍌banana store🍌 introduced in Summarizing Data in SQL tip.
How long do people wait for their tasty banana orders? Using basic SQL we can compute average wait time, but if the distribution is skewed away from normal (as many internet-driven (and banana?) distributions often are), this may not give us a complete picture of how long most people are waiting. In addition to computing the average, we might (and should) ask, what are the 25th, 50th, 75th percentiles of wait-time, and how does that number vary day to day?
Many databases (including Postgres 9.4, Redshift, SQL Server) have built in percentile functions. Here’s an example using the function percentile_cont
which is a window function that computes the percentile of wait-time, split (pun intended!) by day:
The structure of the percentile_cont
is similar to other window functions: we specify how to order the data, how to group it - and the database does the rest. If we wanted to add more dimensions to our query (e.g. time of day), we’d add them to the partition
and group by
clause.
If our database doesn’t support percentile_cont
(sorry MySQL, Postgres < 9.4), the query is more complicated, but fear not, still possible! The challenge is to order the rows by increasing wait-time (per date of course) and then pick out the middle value (for median). In MySQL, we can use local variables to keep track of the order, and in Postgres, we can use the row-number function. Here’s the Postgres version: