How quickly can you understand data from your database? Excel croaks at ten thousands rows, R is difficult to run on a cluster, and your eye certainly can’t summarize large tables. SQL to the rescue!
Summary statistics are the fastest way to learning about your dataset. Are there outliers? How does the distribution look? What are the relationships hiding inside the rows and columns? You’ll always want to ask these questions when faced with a new dataset. SQL (an uninspired acronym for Structured Query Language) is the native language for database computation. Many summary methods are available as built in SQL functions in modern databases and more complex measures can be implemented directly in SQL as well.
In this post, we’ll walk through 4 ways to summarize your dataset in SQL so you can run it directly on your database.
Here’s a relatable example: suppose we work in a 🍌banana stand’s🍌 back office and we analyze banana data.
With just a few SQL commands, we’ll be able to calculate the basic stats: count
, distinct
, sum
, average
, variance
, … If we have a table of banana transactions, let’s calculate the total number of customers, unique customers, number of bananas sold, as well as total and average revenue per day:
This should be a familiar SQL pattern (and if not, come to the next free Wagon SQL class!). With just one query, we can calculate important aggregations over very large datasets. If we dress it up with a few where
statements or join
with customer lookup tables, we can quickly and effectively slice and dice our data. Unfortunately, there are some tricky questions that can’t be answered with the regular SQL functions alone.