SQL Tutorial

SQL Tutorial
Friendly tips to help you learn SQL
`select 💝 from wagon_team;`

Calculating a Joint Distribution in SQL

This tip builds on the analysis of a 🍌banana store🍌 introduced in Summarizing Data in SQL tip.

Comparing two numerical measures is very important as we try to summarize our data. Though the math becomes more complicated, SQL can still handle it. For example, we might be nervous that if people have to wait too long, they end up spending less money when they get to the front of the line. To get a rough idea of the relationship between wait-time and revenue, we might write a query very similar to the histogram one:

These queries give us some insight into the join distribution (plug: Wagon makes charts like these super easy to generate). We might also be interested in more statistical measurements like covariance or correlation. Most of the popular SQL implementations have these statistics built in, so for example in Postgres/Redshift we can write:

Postgres has all kinds of cool aggregate functions, including linear regressions - right in our SQL query!

Read these SQL tips to learn more:

Creating a Histogram in SQL

This tip builds on the analysis of a 🍌banana store🍌 introduced in Summarizing Data in SQL tip.

It’s often useful to have a rough idea of the distribution of the data in a table or query result. Generating a histogram is a great way to understand that distribution. What’s the distribution of revenue per purchase at the banana stand? I mean, how much could a banana cost? We might (naively) write a query like:

It’s likely this query will return far too many rows to eyeball the revenue distribution (1 row per distinct price). Instead, we’d like to group revenue values together into buckets so that we can understand the shape of the data. We might be interested in the number of banana purchases which generated between $0 to $5, $5 to $10, etc. There’s no “correct” number of buckets to use, it’s a choice we can make and experiment with to get a better understanding of the distribution. Probably 1000 buckets is too many, but 2 is too few. A reasonable rule of thumb is to use somewhere around 20 buckets. Our query will specify the width of the bucket, rather than the total number. The larger the width, the fewer buckets we’ll end up with.

This is a simple, but tricky query that will generate a histogram for us. It rounds each revenue data point down to the nearest multiple of 5 and then groups by that rounded value. It has one failing in that if we have no data in a bucket (e.g. no purchases of 55 to 60 dollars), then that row will not appear in the results. We can fix that with a more complex query, but let’s skip it for now. In order to choose our bucket size, it helps to first calculate the min/max values so we know how many buckets we would end up with. If we want the buckets to have slightly nicer labels, we can format the labels with a query like:

select bucket_floor, CONCAT(bucket_floor, ' to ', bucket_ceiling) as bucket_name, count(*) as count from ( select floor(revenue/5.00)*5 as bucket_floor, floor(revenue/5.00)*5 + 5 as bucket_ceiling from web_sessions_table ) a group by 1, 2 order by 1;

Read these SQL tips to learn more:

Calculating Percentiles in SQL

This tip builds on the analysis of a 🍌banana store🍌 introduced in Summarizing Data in SQL tip.

How long do people wait for their tasty banana orders? Using basic SQL we can compute average wait time, but if the distribution is skewed away from normal (as many internet-driven (and banana?) distributions often are), this may not give us a complete picture of how long most people are waiting. In addition to computing the average, we might (and should) ask, what are the 25th, 50th, 75th percentiles of wait-time, and how does that number vary day to day?

Many databases (including Postgres 9.4, Redshift, SQL Server) have built in percentile functions. Here’s an example using the function percentile_cont which is a window function that computes the percentile of wait-time, split (pun intended!) by day:

SELECT date, percentile_cont (0.25) WITHIN GROUP (ORDER BY wait_time ASC) OVER(PARTITION BY date) as percentile_25, percentile_cont (0.50) WITHIN GROUP (ORDER BY wait_time ASC) OVER(PARTITION BY date) as percentile_50, percentile_cont (0.75) WITHIN GROUP (ORDER BY wait_time ASC) OVER(PARTITION BY date) as percentile_75, avg(wait_time) as avg -- for comparison FROM banana_sales GROUP BY date ORDER BY date;

The structure of the percentile_cont is similar to other window functions: we specify how to order the data, how to group it - and the database does the rest. If we wanted to add more dimensions to our query (e.g. time of day), we’d add them to the partition and group by clause. If our database doesn’t support percentile_cont (sorry MySQL, Postgres < 9.4), the query is more complicated, but fear not, still possible! The challenge is to order the rows by increasing wait-time (per date of course) and then pick out the middle value (for median). In MySQL, we can use local variables to keep track of the order, and in Postgres, we can use the row-number function. Here’s the Postgres version:

SELECT t1.date, t1.wait_time as median FROM ( SELECT date, wait_time, ROW_NUMBER() OVER(ORDER BY wait_time PARTITION BY date) as row_num FROM banana_sales ) t JOIN ( SELECT date, count(*) as total FROM banana_sales GROUP BY date ) t2 ON t1.date = t2.date -- for simplicity, we take a simple solution when the list has an even length, to just choose one value WHERE t1.row_num = CASE when t2.total % 2 = 0 THEN t2.total / 2 ELSE (t2.total + 1) / 2 END;

Read these SQL tips to learn more:

Summarizing Data in SQL

How quickly can you understand data from your database? Excel croaks at ten thousands rows, R is difficult to run on a cluster, and your eye certainly can’t summarize large tables. SQL to the rescue!

Summary statistics are the fastest way to learning about your dataset. Are there outliers? How does the distribution look? What are the relationships hiding inside the rows and columns? You’ll always want to ask these questions when faced with a new dataset. SQL (an uninspired acronym for Structured Query Language) is the native language for database computation. Many summary methods are available as built in SQL functions in modern databases and more complex measures can be implemented directly in SQL as well.

In this post, we’ll walk through 4 ways to summarize your dataset in SQL so you can run it directly on your database.

Here’s a relatable example: suppose we work in a 🍌banana stand’s🍌 back office and we analyze banana data.

With just a few SQL commands, we’ll be able to calculate the basic stats: count, distinct, sum, average, variance, … If we have a table of banana transactions, let’s calculate the total number of customers, unique customers, number of bananas sold, as well as total and average revenue per day:

SELECT date, count(*) as num_customers, count(distinct user_id) as distinct_customers, sum(bananas_sold) as total_bananas, sum(revenue) as total_revenue, avg(revenue) as revenue_per_sale FROM banana_sales GROUP BY date ORDER BY date;

date, num_customers, distinct_customers, total_bananas, total_revenue, revenue_per_sale 2016-01-01, 345, 287, 564, 3689, 6.54 2016-01-02, 364, 299, 582, 4080, 7.01 ..., ..., ..., ..., ..., ...

This should be a familiar SQL pattern (and if not, come to the next free Wagon SQL class!). With just one query, we can calculate important aggregations over very large datasets. If we dress it up with a few where statements or join with customer lookup tables, we can quickly and effectively slice and dice our data. Unfortunately, there are some tricky questions that can’t be answered with the regular SQL functions alone.

Read these SQL tips to learn more:

Date formatting in SQL

Timestamps are great for debugging logs. But for analysis, we want to group by day, week, and month.

Timestamps look like this: 07/04/1776 00:28:39.682, but you’d rather see “July 4, 1776”, “July 1776” or “Q3 - 1776” 🇺🇸. Let’s clean up those messy timestamps in SQL!

Postgres & Redshift

In Postgres & Redshift, the to_char function is handy for formatting dates. It takes two arguments: a timestamp and the output format pattern. For example, the following returns “July 04, 1776”:

select to_char('07/04/1776 00:28:39.682'::timestamp, 'Month DD, YYYY');

The next SQL statement returns “Q3 - 1776”, by concatenating “Q” to the quarter and year from the to_char function.

select 'Q' || to_char('07/04/1776 00:28:39.682'::timestamp, 'Q - YYYY');

Pro Tip: Use to_char to format numbers too! See all the possibilities in the official docs.

MySQL

In MySQL, the date_format function will make those timestamps prettier. Here’s how to make ‘1776-07-04 00:28:39.682’ look like “July 4, 1776”:

select date_format('1776-07-04 00:28:39.682', '%M %e, %Y');

Learn More

Official to_char and date_format docs
Date format by country (Wikipedia)

How is my database organized?

What’s the difference between a database, schema, and a table? Let’s break it down.

In relational databases, data is organized in a hierarchy, like files and folders. Databases have schemas. Schemas have tables. Tables have columns.

The standard flow is you connect to a database, browse schemas, and query a table’s columns. There are a few differences for this organization for Postgres, Redshift, and MySQL so here are some tips to navigating your database.

Postgres & Redshift

In Postgres and Redshift, you connect to a single database and can only access schemas and their tables within that database.

Diagram of Postgres tables are organized.

For example, in Wagon’s demo database, there is a schema named dvd containing tables such as products, orders, and customers.

To query one of these tables, you need to prefix your table with dvd.:

select * from dvd.orders;

If you are frequently querying the same schema, it can be tedious to retype the schema over and over. You can inform the query engine that you’d like to use a specific schema by using set search_path to schema_name;:

select * from orders; -- will generate “relation "orders" does not exist” error!
set search_path to dvd;
select * from orders; -- this works now!

Pro tip: You can set search_path to multiple schemas too: set search_path to schema1, schema2, schema3.

A good question: Is every table in a schema? Tables that are created without a specific schema are put in a default schema called public. This is the default search_path when you first connect to your database.

MySQL

In MySQL, databases and schemas are synonyms. You connect to a specific database, but can still query tables in other databases.

Diagram of how MySQL tables are organized.

To query data from another database, you must prefix the database name. For example, if you are connected to the dvd database and want to query the baseball database, you must prefix table names with baseball.:

select * from baseball.batting;

To switch databases, run the use command:

use new_database;

Learn more

How to extract values from a nested JSON field in SQL

Someone dumped JSON into your database! {“uh”: “oh”, “anything”: “but json”}. What do you do?

Relational databases are beginning to support document types like JSON. It’s an easy, flexible data type to create but can be painful to query.

Here’s how to extract values from nested JSON in SQL 🔨:

Example

{"userId": "555", "transaction": {"id": "222", "sku": "ABC"}

Let’s select a column for each userId, id, and sku. The following queries return 555, 222, ABC.

Postgres

Use the ->> operator to extract a value as text, and the -> to extract a JSON object:

select
    my_json_field ->> 'userId',
    my_json_field -> 'transaction' ->> 'id',
    my_json_field -> 'transaction' ->> 'sku'
from my_table;

Redshift

Use the json_extract_path_text function:

select
    json_extract_path_text(my_json_field, 'userId'),
    json_extract_path_text(my_json_field, 'transaction', 'id'),
    json_extract_path_text(my_json_field, 'transaction', 'sku')
from my_table;

MySQL

MySQL 5.7 includes JSON support. Hurray! Use the json_extract function:

select
    json_extract(my_json_field, '$.userId'),
    json_extract(my_json_field, '$.transaction.id'),
    json_extract(my_json_field, '$.transaction.sku')
from my_table;

Learn more

Querying JSON in Postgres
JSON Docs: Postgres, Redshift, MySQL

The easiest way to search for text in SQL

What’s the easiest way to match case insensitive text anywhere in a string?

For example, let’s say you’re a fruit company marketing executive and you want all records matching “orange”, including “navel oranges”, “Syracuse Orange,” “Orange is the New Black”, “The Real Housewives of Orange County”, “orange you glad I didn’t say banana”, “🍊” and so on.

Here’s a quick way to match case insensitive text.

Postgres & Redshift

Use ~*:

select *
from my_table
where item ~* 'orange';

The ~* operator searches for ‘orange’ anywhere in the string, thanks to the power of regular expressions.

MySQL

Use like:

select *
from my_table
where item like '%orange%';

In MySQL, like is case insensitive and much faster than regexp (the MySQL equivalent of ~* in Postgres and Redshift). Add the % wildcard to both sides of ‘orange’ for a substring match.

SQL Tutorial Friendly tips to help you learn SQL select 💝 from wagon_team;

Calculating a Joint Distribution in SQL

Read these SQL tips to learn more:

Creating a Histogram in SQL

Read these SQL tips to learn more:

Calculating Percentiles in SQL

Read these SQL tips to learn more:

Summarizing Data in SQL

Read these SQL tips to learn more:

Date formatting in SQL

Postgres & Redshift

MySQL

Learn More

How is my database organized?

Postgres & Redshift

MySQL

Learn more

How to extract values from a nested JSON field in SQL

Example

Postgres

Redshift

MySQL

Learn more

The easiest way to search for text in SQL

Postgres & Redshift

MySQL

Learn more

SQL Tutorial
Friendly tips to help you learn SQL
`select 💝 from wagon_team;`