The pandas workflow is a common favorite among data analysts and data scientists. The workflow looks something like this:
The pandas workflow works well when:
- the data fits in memory (a few gigabytes but not terabytes)
- the data is relatively static (doesn’t need to be loaded into memory every minute because the data has changed)
- only a single person is accessing the data (shared access to memory is difficult)
- security isn’t important (security is critical for company scale production situations)
When the data changes frequently, requires shared access, doesn’t fit in memory, and security is critical, a database is a much better solution. A database is a data representation that lives on disk that can be queried, accessed, and updated without using much memory. We primarily interact with a database using a database management system or DBMS for short.
In the pandas workflow, we spend most of our time thinking about what functions and methods to use, where to store intermediate results in variables, and juggling all of these. To work with data stored in a database, we instead use a language called SQL (or structured query language). In SQL, we express each unique request (whether it be fetching a subset of or editing values in the data) as a single query and then ask the DBMS to run the query and display any results.
For example, to fetch a specific subset of the data from a database, we would:
- write the SQL query:
SELECT * FROM salaries
- ask the DBMS to run the query and display the results to us
Here’s what the database workflow looks like:
Because the data lives on disk, we can work with datasets that consume multiple terabytes of disk space. Many data science teams in industry have servers and setups in cloud environments like Microsoft Azure or Amazon Web Services that let team members work with this scale of data. Robust and popular DBMS tools like Postgres and MySQL include powerful features for managing user credentials, security, and high data throughput (quickly changing data). In this course and the next, we’ll learn the fundamentals of SQL using a small, portable DBMS called SQLite. SQLite is the most popular database in the world and is lightweight enough that the SQLite DBMS is included as a module in Python. In later courses, we’ll dive into production systems like Postgres.
In this course, we’ll explore data from the American Community Survey on job outcome statistics based on college majors. While the original CSV version can be found on FiveThirtyEight’s Github, we’ll be using a slightly modified version of the data that’s stored as a database. We’ll be working with a of the data that contains the 2010-2012 data for recent college grads only. In this post, we’ll learn how to write SQL queries to explore and start to understand the dataset.
Previewing A Table Using SELECT
Whenever we encountered a new dataset in the past, we displayed the first few rows to get familiar with the different columns, types of values, and some sample data.
We’ve loaded the dataset on job outcome statistics into a database. A database usually consists of multiple, related tables of data. Each table contains rows and columns, just like a CSV file. We’ll be working with the database file jobs.db
, which contains a single table named recent_grads
. In later courses, we’ll learn how to work with a database containing multiple tables.
To display the first 5 rows from the recent_grads
table, we need to:
- write SQL code that expresses this request
- ask the SQLite RDBMS software to run the code and display the results.
Like other programming languages, code in SQL has to adhere to a defined structure and vocabulary. To specify that we want to return the first 5 rows from recent_grads
, we need to run the following SQL query:
SELECT * FROM recent_grads LIMIT 5
In this query, we specified:
- the columns we wanted using
SELECT *
- the table we wanted to query using
FROM recent_grads
- the number of rows we wanted using
LIMIT 5
Here’s a visual breakdown of the different components of the query:
Writing and running SQL queries in our interface is similar to writing and running Python code. Type the query in the code cell and click Run to execute the query against the database. If you write multiple queries in a code cell, SQLite will only display the last query’s results.
Let’s write a SQL query that returns the first 10
rows from recent_grads
.
SELECT * FROM recent_grads LIMIT 10
Filtering Rows Using WHERE
SQLite ran our query and returned the first 10 rows and all columns from the recent_grads
table. Head to the dataset page and spend some time getting familiar with what each column represents.
Based on this dataset preview and an understanding of what each column represents, here are some questions we may have:
- Which majors had mostly female students? Which ones had mostly male students?
- Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?
- Which engineering majors had the highest full time employment rates?
Let’s start by focusing on the first question. The SQL workflow revolves around translating the question we want to answer to the subset of data we want from the database. To determine which majors had mostly female students, we want the following subset:
- only the
Major
column - only the rows where
ShareWomen
is greater than0.5
(corresponding to 50%)
To return only the Major
column, we need to add the specific column name in the SELECT
statement part of the query (instead of using the *
operator to return all columns):
SELECT Major FROM recent_grads
This will return all of the values in the Major
column. We can specify multiple columns this way as well and the results table will preserve the order of the columns:
SELECT Major, Major_category FROM recent_grads
To return only the values where ShareWomen
is greater than or equal to 0.5
, we need to add a WHERE
clause:
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
Finally, we can limit the number of rows returned using LIMIT
:
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
LIMIT 5
Here’s a breakdown of the different components:
While in the SELECT
part of the query, we express the specific column we want, in the WHERE
part we express the specific rows we want. The beauty of SQL is that these can be independent.
Let’s write a SQL query that returns the majors where females were a minority. We’ll Only return the Major
and ShareWomen
columns (in that order) and don’t limit the number of rows returned.
SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen < 0.5
Expressing Multiple Filter Criteria Using AND
To filter rows by specific criteria, we need to use the WHERE
statement. A simple WHERE
statement requires three things:
- The column we want the database to filter on:
ShareWomen
- A comparison operator that specifies how we want to compare a value in a column:
>
- The value we want the database to compare each value to:
0.5
Here are the comparison operators we can use:
- Less than:
<
- Less than or equal to:
<=
- Greater than:
>
- Greater than or equal to:
>=
- Equal to:
=
- Not equal to:
!=
The comparison value after the operator must be either text or a number, depending on the field. Because ShareWomen
is a numeric column, we don’t need to enclose the number 0.5
in quotes. Finally, most database systems require that the SELECT
and FROM
statements come first, before WHERE
or any other statements.
We can use the AND
operator to combine multiple filter criteria. For example, to determine which engineering majors had majority female, we’d need to specify 2 filtering criteria.
SELECT Major FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
It looks like only 2 majors met this criteria. If we wanted to “zoom” back out to look at all of the columns for both of these majors to see if they shared some other common attributes, we can modify the SELECT
statement and use the symbol *
to represent all columns
SELECT * FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
The ability to quickly iterate on queries as you think of new questions is the appeal of SQL. The SQL workflow lets data professionals focus on asking and answering questions, instead of lower level programming concepts. There’s a clear separation of concerns between the engine that stores, organizes, and retrieves the data and the language that let’s people interface with the data easily.
As the scale of data has increased, engineers have maintained the interface of SQL while swapping out the database engine underneath. This allows people who need to ask and answer questions to easily transfer their SQL experience, even as database technologies change. For example, the Presto project lets you query using SQL but use data from database systems like MySQL, from a distributed file system like HDFS, and more.
Let’s write a SQL query that returns all majors with majority female and all majors had a median salary greater than 50000
. Let’s only include the following columns in the results and in this order:
SELECT Major, Major_category, Median, ShareWomen FROM recent_grads WHERE ShareWomen > 0.5 AND Median > 50000
Returning One of Several Conditions With OR
We used the AND
operator to specify that our filter needs to pass two Boolean conditions. Both of the conditions had to evaluate to True
for the record to appear in the result set. If we wanted to specify a filter that meets either of the conditions instead, we would use the OR
operator.
SELECT [column1, column2,...] FROM [table1]
WHERE [condition1] OR [condition2]
We’ll dive straight into a practice problem because we use the OR
and AND
operators in similar ways.
Write a SQL query that returns the first 20 majors that either have a Median
salary greater than or equal to 10,000
, or have less than or equal to 1,000
Unemployed
people. Let’s only include the following columns in the results and in this order:
Major
Median
Unemployed
SELECT Major, Median, Unemployed FROM recent_grads WHERE Median >= 10000 OR Unemployed <= 1000 LIMIT 20
Grouping Operators With Parentheses
There’s a certain class of questions that we can’t answer using only the techniques we’ve learned so far. For example, if we wanted to write a query that returned all Engineering
majors that either had mostly female graduates or an unemployment rate below 5.1%, we would need to use parentheses to express this more complex logic.
The three raw conditions we’ll need are:
Major_category = 'Engineering'
ShareWomen >= 0.5
Unemployment_rate < 0.051
What the SQL query looks like using parantheses:
SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051);
The first thing you may notice is that we didn’t capitalize any of the operators or statements in the query. SQL’s built-in keywords are case-insensitive, which means we don’t have to capitalize operators like AND
or statements like SELECT
. This also goes for the column names (you can use either major_category
or Major_category
). We’ll stick to using capitalized SQL and the original column names to stay consistent.
The second thing you may notice is how we enclosed the logic we wanted to be evaluated together in parentheses. This is very similar to how we group mathematical calculations together in a particular order. The parentheses makes it explictly clear to the database that we want all of the rows where both of the expressions in the statements evaluate to True
:
(Major_category = 'Engineering' AND ShareWomen > 0.5) -> True or False
(ShareWomen > 0.5 OR Unemployment_rate < 0.051) -> True or False
If we had written the where
statement without any parentheses, the database would guess what our intentions are, and actually execute the following query instead:
WHERE (Major_category = 'Engineering' AND ShareWomen > 0.5) OR (Unemployment_rate < 0.051)
Leaving the parentheses out implies that we want the calculation to happen from left to right in the order in which the logic is written, and wouldn’t return us the data we want. Now let’s run our intended query and see the results!
Let’s run the query we explored above, which returns all Engineering
majors that either had mostly women graduates or had an unemployment rate below 5.1%, which was the rate in August 2015. Let’s only include the following columns in the results and in this order:
Major
Major_category
ShareWomen
Unemployment_rate
SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
Ordering Results Using ORDER BY
The results of every query we’ve written so far have been ordered by the Rank
column. Recall a query from early in the post, where we wrote a query that returned all of the columns and didn’t filter rows on any specific criteria:
SELECT * FROM recent_grads LIMIT 5
As the questions we want to answer get more complex, we want more control over how the results are ordered. We can specify the order using the ORDER BY clause. For example, we may want to understand which majors that met the criteria in the WHERE
statement had the lowest unemployment rate. The following query will return the results in ascending order by the Unemployment_rate
column.
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate
If we instead want the results ordered by the same column but in descending order, we can add the DESC
keyword:
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate DESC
Let’s write a query that returns all majors where ShareWomen
is greater than 0.3
and Unemployment_rate
is less than .1
. Let’s only include the following columns in the results and in this order:
Major
,ShareWomen
,Unemployment_rate
We’ll order the results in descending order by the ShareWomen
column.
SELECT Major, ShareWomen, Unemployment_rate FROM recent_grads
WHERE ShareWomen > 0.3 AND Unemployment_rate < .1
ORDER BY ShareWomen DESC
SQL is a powerful language for accessing data and we hope you got a taste for it in this post If you’d like to learn more, we encourage you to check out the SQL Fundamentals course, from which this blog post is based on. In the course, we dive into how to:
- calculate summary statistics
- segment data using grouping
- write more complex queries using subqueries
- create your own local SQLite database and query it using Python
In the course, you’ll continue to work with salary data from recent college graduates as well as data from the CIA World Factbook.