SQL Fundamentals

The pandas workflow is a common favorite among data analysts and data scientists. The workflow looks something like this:

Pandas Workflow

The pandas workflow works well when:

  • the data fits in memory (a few gigabytes but not terabytes)
  • the data is relatively static (doesn’t need to be loaded into memory every minute because the data has changed)
  • only a single person is accessing the data (shared access to memory is difficult)
  • security isn’t important (security is critical for company scale production situations)

When the data changes frequently, requires shared access, doesn’t fit in memory, and security is critical, a database is a much better solution. A database is a data representation that lives on disk that can be queried, accessed, and updated without using much memory. We primarily interact with a database using a database management system or DBMS for short.

In the pandas workflow, we spend most of our time thinking about what functions and methods to use, where to store intermediate results in variables, and juggling all of these. To work with data stored in a database, we instead use a language called SQL (or structured query language). In SQL, we express each unique request (whether it be fetching a subset of or editing values in the data) as a single query and then ask the DBMS to run the query and display any results.

For example, to fetch a specific subset of the data from a database, we would:

  • write the SQL query: SELECT * FROM salaries
  • ask the DBMS to run the query and display the results to us

Here’s what the database workflow looks like:

Database Workflow

Because the data lives on disk, we can work with datasets that consume multiple terabytes of disk space. Many data science teams in industry have servers and setups in cloud environments like Microsoft Azure or Amazon Web Services that let team members work with this scale of data. Robust and popular DBMS tools like Postgres and MySQL include powerful features for managing user credentials, security, and high data throughput (quickly changing data). In this course and the next, we’ll learn the fundamentals of SQL using a small, portable DBMS called SQLite. SQLite is the most popular database in the world and is lightweight enough that the SQLite DBMS is included as a module in Python. In later courses, we’ll dive into production systems like Postgres.

In this course, we’ll explore data from the American Community Survey on job outcome statistics based on college majors. While the original CSV version can be found on FiveThirtyEight’s Github, we’ll be using a slightly modified version of the data that’s stored as a database. We’ll be working with a of the data that contains the 2010-2012 data for recent college grads only. In this post, we’ll learn how to write SQL queries to explore and start to understand the dataset.

Previewing A Table Using SELECT

Whenever we encountered a new dataset in the past, we displayed the first few rows to get familiar with the different columns, types of values, and some sample data.

We’ve loaded the dataset on job outcome statistics into a database. A database usually consists of multiple, related tables of data. Each table contains rows and columns, just like a CSV file. We’ll be working with the database file jobs.db, which contains a single table named recent_grads. In later courses, we’ll learn how to work with a database containing multiple tables.

SQL Table

To display the first 5 rows from the recent_grads table, we need to:

  • write SQL code that expresses this request
  • ask the SQLite RDBMS software to run the code and display the results.

Like other programming languages, code in SQL has to adhere to a defined structure and vocabulary. To specify that we want to return the first 5 rows from recent_grads, we need to run the following SQL query:

SELECT * FROM recent_grads LIMIT 5
index Rank Major_code Major Major_category Total Sample_size Men Women ShareWomen Employed Full_time Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING Engineering 2339 36 2057 282 0.120564 1976 1849 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING Engineering 756 7 679 77 0.101852 640 556 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING Engineering 856 3 725 131 0.153037 648 558 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 1258 16 1123 135 0.107313 758 1069 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING Engineering 32260 289 21239 11021 0.341631 25694 23170 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972

In this query, we specified:

  • the columns we wanted using SELECT *
  • the table we wanted to query using FROM recent_grads
  • the number of rows we wanted using LIMIT 5

Here’s a visual breakdown of the different components of the query:

SQL Select Breakdown 2

Writing and running SQL queries in our interface is similar to writing and running Python code. Type the query in the code cell and click Run to execute the query against the database. If you write multiple queries in a code cell, SQLite will only display the last query’s results.

Let’s write a SQL query that returns the first 10 rows from recent_grads.

SELECT * FROM recent_grads LIMIT 10
index Rank Major_code Major Major_category Total Sample_size Men Women ShareWomen Employed Full_time Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING Engineering 2339 36 2057 282 0.120564 1976 1849 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING Engineering 756 7 679 77 0.101852 640 556 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING Engineering 856 3 725 131 0.153037 648 558 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 1258 16 1123 135 0.107313 758 1069 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING Engineering 32260 289 21239 11021 0.341631 25694 23170 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972
5 6 2418 NUCLEAR ENGINEERING Engineering 2573 17 2200 373 0.144967 1857 2038 264 1449 400 0.177226 65000 50000 102000 1142 657 244
6 7 6202 ACTUARIAL SCIENCE Business 3777 51 832 960 0.535714 2912 2924 296 2482 308 0.095652 62000 53000 72000 1768 314 259
7 8 5001 ASTRONOMY AND ASTROPHYSICS Physical Sciences 1792 10 2110 1667 0.441356 1526 1085 553 827 33 0.021167 62000 31500 109000 972 500 220
8 9 2414 MECHANICAL ENGINEERING Engineering 91227 1029 12953 2105 0.139793 76442 71298 13101 54639 4650 0.057342 60000 48000 70000 52844 16384 3253
9 10 2408 ELECTRICAL ENGINEERING Engineering 81527 631 8407 6548 0.437847 61928 55450 12695 41413 3895 0.059174 60000 45000 72000 45829 10874 3170

Filtering Rows Using WHERE

SQLite ran our query and returned the first 10 rows and all columns from the recent_grads table. Head to the dataset page and spend some time getting familiar with what each column represents.

Based on this dataset preview and an understanding of what each column represents, here are some questions we may have:

  • Which majors had mostly female students? Which ones had mostly male students?
  • Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?
  • Which engineering majors had the highest full time employment rates?

Let’s start by focusing on the first question. The SQL workflow revolves around translating the question we want to answer to the subset of data we want from the database. To determine which majors had mostly female students, we want the following subset:

  • only the Major column
  • only the rows where ShareWomen is greater than 0.5 (corresponding to 50%)

To return only the Major column, we need to add the specific column name in the SELECT statement part of the query (instead of using the * operator to return all columns):

SELECT Major FROM recent_grads

This will return all of the values in the Major column. We can specify multiple columns this way as well and the results table will preserve the order of the columns:

SELECT Major, Major_category FROM recent_grads

To return only the values where ShareWomen is greater than or equal to 0.5, we need to add a WHERE clause:

SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5

Finally, we can limit the number of rows returned using LIMIT:

SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
LIMIT 5
Major
ACTUARIAL SCIENCE
COMPUTER SCIENCE
ENVIRONMENTAL ENGINEERING
NURSING
INDUSTRIAL PRODUCTION TECHNOLOGIES

Here’s a breakdown of the different components:

SQL WHERE Breakdown

While in the SELECT part of the query, we express the specific column we want, in the WHERE part we express the specific rows we want. The beauty of SQL is that these can be independent.

Let’s write a SQL query that returns the majors where females were a minority. We’ll Only return the Major and ShareWomen columns (in that order) and don’t limit the number of rows returned.

SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen < 0.5
Major ShareWomen
PETROLEUM ENGINEERING 0.120564
MINING AND MINERAL ENGINEERING 0.101852
METALLURGICAL ENGINEERING 0.153037
NAVAL ARCHITECTURE AND MARINE ENGINEERING 0.107313
CHEMICAL ENGINEERING 0.341631
NUCLEAR ENGINEERING 0.144967
ASTRONOMY AND ASTROPHYSICS 0.441356
MECHANICAL ENGINEERING 0.139793
ELECTRICAL ENGINEERING 0.437847
COMPUTER ENGINEERING 0.199413
AEROSPACE ENGINEERING 0.196450
BIOMEDICAL ENGINEERING 0.119559
MATERIALS SCIENCE 0.310820
ENGINEERING MECHANICS PHYSICS AND SCIENCE 0.183985
BIOLOGICAL ENGINEERING 0.320784
INDUSTRIAL AND MANUFACTURING ENGINEERING 0.343473
GENERAL ENGINEERING 0.252960
ARCHITECTURAL ENGINEERING 0.350442
COURT REPORTING 0.236063
FOOD SCIENCE 0.222695
ELECTRICAL ENGINEERING TECHNOLOGY 0.325092
MATERIALS ENGINEERING AND MATERIALS SCIENCE 0.292607
MANAGEMENT INFORMATION SYSTEMS AND STATISTICS 0.278790
CIVIL ENGINEERING 0.227118
CONSTRUCTION SERVICES 0.342229
OPERATIONS LOGISTICS AND E-COMMERCE 0.322222
MISCELLANEOUS ENGINEERING 0.189970
PUBLIC POLICY 0.251389
ENGINEERING TECHNOLOGIES 0.090713
MISCELLANEOUS FINE ARTS 0.410180
GEOLOGICAL AND GEOPHYSICAL ENGINEERING 0.324838
FINANCE 0.355469
ECONOMICS 0.340825
BUSINESS ECONOMICS 0.249190
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL … 0.430537
ACCOUNTING 0.253583
MATHEMATICS 0.244103
PHYSICS 0.448099
MEDICAL TECHNOLOGIES TECHNICIANS 0.434298
STATISTICS AND DECISION SCIENCE 0.281936
ENGINEERING AND INDUSTRIAL MANAGEMENT 0.174123
MEDICAL ASSISTING SERVICES 0.178982
COMPUTER PROGRAMMING AND DATA PROCESSING 0.269194
GENERAL BUSINESS 0.417925
ARCHITECTURE 0.321770
INTERNATIONAL BUSINESS 0.282903
PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTR… 0.451465
MOLECULAR BIOLOGY 0.077453
MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION 0.200023
MISCELLANEOUS ENGINEERING TECHNOLOGIES 0.000000
MECHANICAL ENGINEERING RELATED TECHNOLOGIES 0.377437
INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY 0.436302
PHYSICAL SCIENCES 0.426924
MILITARY TECHNOLOGIES 0.429685
ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLO… 0.232444
MARKETING AND MARKETING RESEARCH 0.382900
POLITICAL SCIENCE AND GOVERNMENT 0.485930
GEOGRAPHY 0.473190
COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY 0.180883
COMPUTER NETWORKING AND TELECOMMUNICATIONS 0.305005
GEOLOGY AND EARTH SCIENCE 0.470197
PUBLIC ADMINISTRATION 0.476461
COMMUNICATIONS 0.305109
CRIMINAL JUSTICE AND FIRE PROTECTION 0.125035
COMMERCIAL ART AND GRAPHIC DESIGN 0.374356
SPECIAL NEEDS EDUCATION 0.366177
TRANSPORTATION SCIENCES AND TECHNOLOGIES 0.321296
NEUROSCIENCE 0.475010
MULTI/INTERDISCIPLINARY STUDIES 0.495397
ATMOSPHERIC SCIENCES AND METEOROLOGY 0.124950
EDUCATIONAL ADMINISTRATION AND SUPERVISION 0.448732
PHILOSOPHY AND RELIGIOUS STUDIES 0.416810
ENGLISH LANGUAGE AND LITERATURE 0.339671
SCIENCE AND COMPUTER TEACHER EDUCATION 0.423209
MUSIC 0.444582
COSMETOLOGY SERVICES AND CULINARY ARTS 0.383719

Expressing Multiple Filter Criteria Using AND

To filter rows by specific criteria, we need to use the WHERE statement. A simple WHERE statement requires three things:

  • The column we want the database to filter on: ShareWomen
  • A comparison operator that specifies how we want to compare a value in a column: >
  • The value we want the database to compare each value to: 0.5

Here are the comparison operators we can use:

  • Less than: <
  • Less than or equal to: <=
  • Greater than: >
  • Greater than or equal to: >=
  • Equal to: =
  • Not equal to: !=

The comparison value after the operator must be either text or a number, depending on the field. Because ShareWomen is a numeric column, we don’t need to enclose the number 0.5 in quotes. Finally, most database systems require that the SELECT and FROM statements come first, before WHERE or any other statements.

We can use the AND operator to combine multiple filter criteria. For example, to determine which engineering majors had majority female, we’d need to specify 2 filtering criteria.

SELECT Major FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
Major
ENVIRONMENTAL ENGINEERING
INDUSTRIAL PRODUCTION TECHNOLOGIES

It looks like only 2 majors met this criteria. If we wanted to “zoom” back out to look at all of the columns for both of these majors to see if they shared some other common attributes, we can modify the SELECT statement and use the symbol * to represent all columns

SELECT * FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
index Rank Major_code Major Major_category Total Sample_size Men Women ShareWomen Employed Full_time Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
30 31 2410 ENVIRONMENTAL ENGINEERING Engineering 4047 26 2639 3339 0.558548 2983 2384 930 1951 308 0.093589 50000 42000 56000 2028 830 260
38 39 2503 INDUSTRIAL PRODUCTION TECHNOLOGIES Engineering 4631 73 528 1588 0.750473 4428 3988 597 3242 129 0.028308 46000 35000 65000 1394 2454 480

The ability to quickly iterate on queries as you think of new questions is the appeal of SQL. The SQL workflow lets data professionals focus on asking and answering questions, instead of lower level programming concepts. There’s a clear separation of concerns between the engine that stores, organizes, and retrieves the data and the language that let’s people interface with the data easily.

As the scale of data has increased, engineers have maintained the interface of SQL while swapping out the database engine underneath. This allows people who need to ask and answer questions to easily transfer their SQL experience, even as database technologies change. For example, the Presto project lets you query using SQL but use data from database systems like MySQL, from a distributed file system like HDFS, and more.

Let’s write a SQL query that returns all majors with majority female and all majors had a median salary greater than 50000. Let’s only include the following columns in the results and in this order:

SELECT Major, Major_category, Median, ShareWomen FROM recent_grads WHERE ShareWomen > 0.5 AND Median > 50000
Major Major_category Median ShareWomen
ACTUARIAL SCIENCE Business 62000 0.535714
COMPUTER SCIENCE Computers & Mathematics 53000 0.578766

Returning One of Several Conditions With OR

We used the AND operator to specify that our filter needs to pass two Boolean conditions. Both of the conditions had to evaluate to True for the record to appear in the result set. If we wanted to specify a filter that meets either of the conditions instead, we would use the OR operator.

SELECT [column1, column2,...] FROM [table1]
WHERE [condition1] OR [condition2]

We’ll dive straight into a practice problem because we use the OR and AND operators in similar ways.

Write a SQL query that returns the first 20 majors that either have a Median salary greater than or equal to 10,000, or have less than or equal to 1,000 Unemployed people. Let’s only include the following columns in the results and in this order:

  • Major
  • Median
  • Unemployed
SELECT Major, Median, Unemployed FROM recent_grads WHERE Median >= 10000 OR Unemployed <= 1000 LIMIT 20
Major Median Unemployed
PETROLEUM ENGINEERING 110000 37
MINING AND MINERAL ENGINEERING 75000 85
METALLURGICAL ENGINEERING 73000 16
NAVAL ARCHITECTURE AND MARINE ENGINEERING 70000 40
CHEMICAL ENGINEERING 65000 1672
NUCLEAR ENGINEERING 65000 400
ACTUARIAL SCIENCE 62000 308
ASTRONOMY AND ASTROPHYSICS 62000 33
MECHANICAL ENGINEERING 60000 4650
ELECTRICAL ENGINEERING 60000 3895
COMPUTER ENGINEERING 60000 2275
AEROSPACE ENGINEERING 60000 794
BIOMEDICAL ENGINEERING 60000 1019
MATERIALS SCIENCE 60000 78
ENGINEERING MECHANICS PHYSICS AND SCIENCE 58000 23
BIOLOGICAL ENGINEERING 57100 589
INDUSTRIAL AND MANUFACTURING ENGINEERING 57000 699
GENERAL ENGINEERING 56000 2859
ARCHITECTURAL ENGINEERING 54000 170
COURT REPORTING 54000 11

Grouping Operators With Parentheses

There’s a certain class of questions that we can’t answer using only the techniques we’ve learned so far. For example, if we wanted to write a query that returned all Engineering majors that either had mostly female graduates or an unemployment rate below 5.1%, we would need to use parentheses to express this more complex logic.

The three raw conditions we’ll need are:

Major_category = 'Engineering'
ShareWomen >= 0.5
Unemployment_rate < 0.051

What the SQL query looks like using parantheses:

SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051);

The first thing you may notice is that we didn’t capitalize any of the operators or statements in the query. SQL’s built-in keywords are case-insensitive, which means we don’t have to capitalize operators like AND or statements like SELECT. This also goes for the column names (you can use either major_category or Major_category). We’ll stick to using capitalized SQL and the original column names to stay consistent.

The second thing you may notice is how we enclosed the logic we wanted to be evaluated together in parentheses. This is very similar to how we group mathematical calculations together in a particular order. The parentheses makes it explictly clear to the database that we want all of the rows where both of the expressions in the statements evaluate to True:

(Major_category = 'Engineering' AND ShareWomen > 0.5) -> True or False
(ShareWomen > 0.5 OR Unemployment_rate < 0.051) -> True or False

If we had written the where statement without any parentheses, the database would guess what our intentions are, and actually execute the following query instead:

WHERE (Major_category = 'Engineering' AND ShareWomen > 0.5) OR (Unemployment_rate < 0.051) 

Leaving the parentheses out implies that we want the calculation to happen from left to right in the order in which the logic is written, and wouldn’t return us the data we want. Now let’s run our intended query and see the results!

Let’s run the query we explored above, which returns all Engineering majors that either had mostly women graduates or had an unemployment rate below 5.1%, which was the rate in August 2015. Let’s only include the following columns in the results and in this order:

  • Major
  • Major_category
  • ShareWomen
  • Unemployment_rate
SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
Major Major_category ShareWomen Unemployment_rate
PETROLEUM ENGINEERING Engineering 0.120564 0.018381
METALLURGICAL ENGINEERING Engineering 0.153037 0.024096
NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 0.107313 0.050125
MATERIALS SCIENCE Engineering 0.310820 0.023043
ENGINEERING MECHANICS PHYSICS AND SCIENCE Engineering 0.183985 0.006334
INDUSTRIAL AND MANUFACTURING ENGINEERING Engineering 0.343473 0.042876
MATERIALS ENGINEERING AND MATERIALS SCIENCE Engineering 0.292607 0.027789
ENVIRONMENTAL ENGINEERING Engineering 0.558548 0.093589
INDUSTRIAL PRODUCTION TECHNOLOGIES Engineering 0.750473 0.028308
ENGINEERING AND INDUSTRIAL MANAGEMENT Engineering 0.174123 0.033652

Ordering Results Using ORDER BY

The results of every query we’ve written so far have been ordered by the Rank column. Recall a query from early in the post, where we wrote a query that returned all of the columns and didn’t filter rows on any specific criteria:

SELECT * FROM recent_grads LIMIT 5
index Rank Major_code Major Major_category Total Sample_size Men Women ShareWomen Employed Full_time Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING Engineering 2339 36 2057 282 0.120564 1976 1849 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING Engineering 756 7 679 77 0.101852 640 556 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING Engineering 856 3 725 131 0.153037 648 558 133 340 16 0.024096 73000 50000 105000 456 176 0
3 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 1258 16 1123 135 0.107313 758 1069 150 692 40 0.050125 70000 43000 80000 529 102 0
4 5 2405 CHEMICAL ENGINEERING Engineering 32260 289 21239 11021 0.341631 25694 23170 5180 16697 1672 0.061098 65000 50000 75000 18314 4440 972

As the questions we want to answer get more complex, we want more control over how the results are ordered. We can specify the order using the ORDER BY clause. For example, we may want to understand which majors that met the criteria in the WHERE statement had the lowest unemployment rate. The following query will return the results in ascending order by the Unemployment_rate column.

SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate
Rank Major Major_category ShareWomen Unemployment_rate
15 ENGINEERING MECHANICS PHYSICS AND SCIENCE Engineering 0.183985 0.006334
1 PETROLEUM ENGINEERING Engineering 0.120564 0.018381
14 MATERIALS SCIENCE Engineering 0.310820 0.023043
3 METALLURGICAL ENGINEERING Engineering 0.153037 0.024096
24 MATERIALS ENGINEERING AND MATERIALS SCIENCE Engineering 0.292607 0.027789
39 INDUSTRIAL PRODUCTION TECHNOLOGIES Engineering 0.750473 0.028308
51 ENGINEERING AND INDUSTRIAL MANAGEMENT Engineering 0.174123 0.033652
17 INDUSTRIAL AND MANUFACTURING ENGINEERING Engineering 0.343473 0.042876
4 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 0.107313 0.050125
31 ENVIRONMENTAL ENGINEERING Engineering 0.558548 0.093589

If we instead want the results ordered by the same column but in descending order, we can add the DESC keyword:

SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate DESC
Rank Major Major_category ShareWomen Unemployment_rate
31 ENVIRONMENTAL ENGINEERING Engineering 0.558548 0.093589
4 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 0.107313 0.050125
17 INDUSTRIAL AND MANUFACTURING ENGINEERING Engineering 0.343473 0.042876
51 ENGINEERING AND INDUSTRIAL MANAGEMENT Engineering 0.174123 0.033652
39 INDUSTRIAL PRODUCTION TECHNOLOGIES Engineering 0.750473 0.028308
24 MATERIALS ENGINEERING AND MATERIALS SCIENCE Engineering 0.292607 0.027789
3 METALLURGICAL ENGINEERING Engineering 0.153037 0.024096
14 MATERIALS SCIENCE Engineering 0.310820 0.023043
1 PETROLEUM ENGINEERING Engineering 0.120564 0.018381
15 ENGINEERING MECHANICS PHYSICS AND SCIENCE Engineering 0.183985 0.006334

Let’s write a query that returns all majors where ShareWomen is greater than 0.3 and Unemployment_rate is less than .1. Let’s only include the following columns in the results and in this order:

  • Major,
  • ShareWomen,
  • Unemployment_rate

We’ll order the results in descending order by the ShareWomen column.

SELECT Major, ShareWomen, Unemployment_rate FROM recent_grads
WHERE ShareWomen > 0.3 AND Unemployment_rate < .1
ORDER BY ShareWomen DESC
Major ShareWomen Unemployment_rate
EARLY CHILDHOOD EDUCATION 0.967998 0.040105
MATHEMATICS AND COMPUTER SCIENCE 0.927807 0.000000
ELEMENTARY EDUCATION 0.923745 0.046586
ANIMAL SCIENCES 0.910933 0.050862
PHYSIOLOGY 0.906677 0.069163
MISCELLANEOUS PSYCHOLOGY 0.905590 0.051908
HUMAN SERVICES AND COMMUNITY ORGANIZATION 0.904075 0.037819
NURSING 0.896019 0.044863
GEOSCIENCES 0.881294 0.024374
MASS MEDIA 0.877228 0.089837
COGNITIVE SCIENCE AND BIOPSYCHOLOGY 0.854523 0.075236
ART HISTORY AND CRITICISM 0.845934 0.060298
EDUCATIONAL PSYCHOLOGY 0.817099 0.065112
GENERAL EDUCATION 0.812877 0.057360
SOCIAL WORK 0.810704 0.068828
TEACHER EDUCATION: MULTIPLE LEVELS 0.798920 0.036546
COUNSELING PSYCHOLOGY 0.798746 0.053621
MATHEMATICS TEACHER EDUCATION 0.792095 0.016203
PSYCHOLOGY 0.779933 0.083811
GENERAL MEDICAL AND HEALTH SERVICES 0.774577 0.082102
HEALTH AND MEDICAL ADMINISTRATIVE SERVICES 0.770901 0.089626
SOIL SCIENCE 0.764427 0.000000
AREA ETHNIC AND CIVILIZATION STUDIES 0.758060 0.063429
APPLIED MATHEMATICS 0.753927 0.090823
FAMILY AND CONSUMER SCIENCES 0.752144 0.067128
INDUSTRIAL PRODUCTION TECHNOLOGIES 0.750473 0.028308
SOCIAL PSYCHOLOGY 0.747561 0.029650
HUMANITIES 0.745662 0.068584
HOSPITALITY MANAGEMENT 0.733992 0.061169
SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION 0.733968 0.054083
THEOLOGY AND RELIGIOUS VOCATIONS 0.728495 0.062628
FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN L… 0.728033 0.075566
INTERDISCIPLINARY SOCIAL SCIENCES 0.721866 0.092306
MISCELLANEOUS AGRICULTURE 0.719974 0.059767
JOURNALISM 0.719859 0.069176
MISCELLANEOUS EDUCATION 0.718365 0.059212
COMPUTER AND INFORMATION SYSTEMS 0.707719 0.093460
COMMUNICATION DISORDERS SCIENCES AND SERVICES 0.707136 0.047584
MISCELLANEOUS HEALTH MEDICAL PROFESSIONS 0.702020 0.081411
LIBERAL ARTS 0.700898 0.078268
FORESTRY 0.690365 0.096726
OCEANOGRAPHY 0.688999 0.056995
ART AND MUSIC EDUCATION 0.686024 0.038638
PHYSICAL FITNESS PARKS RECREATION AND LEISURE 0.683943 0.051467
ADVERTISING AND PUBLIC RELATIONS 0.673143 0.067961
HUMAN RESOURCES AND PERSONNEL MANAGEMENT 0.672161 0.059570
MULTI-DISCIPLINARY OR GENERAL SCIENCE 0.669999 0.055807
FINE ARTS 0.667034 0.084186
COMPOSITION AND RHETORIC 0.666119 0.081742
HISTORY 0.651741 0.095667
ECOLOGY 0.651660 0.054475
GENETICS 0.643331 0.034118
TREATMENT THERAPY PROFESSIONS 0.640000 0.059821
NUTRITION SCIENCES 0.638147 0.068701
ZOOLOGY 0.637293 0.046320
INTERNATIONAL RELATIONS 0.632987 0.096799
UNITED STATES HISTORY 0.630716 0.047179
DRAMA AND THEATER ARTS 0.629505 0.077541
CRIMINOLOGY 0.618223 0.097244
MICROBIOLOGY 0.615727 0.066776
PLANT SCIENCE AND AGRONOMY 0.606889 0.045455
BIOLOGY 0.601858 0.070725
SECONDARY TEACHER EDUCATION 0.601752 0.052229
AGRICULTURE PRODUCTION AND MANAGEMENT 0.594208 0.050031
PRE-LAW AND LEGAL STUDIES 0.591001 0.071965
AGRICULTURAL ECONOMICS 0.589712 0.077250
STUDIO ARTS 0.584776 0.089552
ENVIRONMENTAL SCIENCE 0.584556 0.078585
BUSINESS MANAGEMENT AND ADMINISTRATION 0.580948 0.072218
COMPUTER SCIENCE 0.578766 0.063173
LANGUAGE AND DRAMA EDUCATION 0.576360 0.050306
MISCELLANEOUS BIOLOGY 0.566641 0.058545
NATURAL RESOURCES MANAGEMENT 0.564639 0.066619
ENVIRONMENTAL ENGINEERING 0.558548 0.093589
HEALTH AND MEDICAL PREPARATORY PROGRAMS 0.556604 0.069780
MISCELLANEOUS SOCIAL SCIENCES 0.543405 0.073080
ACTUARIAL SCIENCE 0.535714 0.095652
SOCIOLOGY 0.532334 0.084951
BOTANY 0.528969 0.000000
INFORMATION SCIENCES 0.526476 0.060741
PHARMACOLOGY 0.524153 0.085532
GENERAL AGRICULTURE 0.515543 0.019642
BIOCHEMICAL SCIENCES 0.515406 0.080531
INTERCULTURAL AND INTERNATIONAL STUDIES 0.507377 0.083634
PHYSICAL AND HEALTH EDUCATION TEACHING 0.506721 0.074667
CHEMISTRY 0.505141 0.053972
MULTI/INTERDISCIPLINARY STUDIES 0.495397 0.070861
NEUROSCIENCE 0.475010 0.048482
GEOLOGY AND EARTH SCIENCE 0.470197 0.075449
PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTR… 0.451465 0.055521
EDUCATIONAL ADMINISTRATION AND SUPERVISION 0.448732 0.000000
PHYSICS 0.448099 0.048224
MUSIC 0.444582 0.075960
ASTRONOMY AND ASTROPHYSICS 0.441356 0.021167
ELECTRICAL ENGINEERING 0.437847 0.059174
MEDICAL TECHNOLOGIES TECHNICIANS 0.434298 0.036983
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL … 0.430537 0.071540
PHYSICAL SCIENCES 0.426924 0.035354
SCIENCE AND COMPUTER TEACHER EDUCATION 0.423209 0.047264
GENERAL BUSINESS 0.417925 0.072861
PHILOSOPHY AND RELIGIOUS STUDIES 0.416810 0.096052
MISCELLANEOUS FINE ARTS 0.410180 0.089375
COSMETOLOGY SERVICES AND CULINARY ARTS 0.383719 0.055677
MARKETING AND MARKETING RESEARCH 0.382900 0.061215
MECHANICAL ENGINEERING RELATED TECHNOLOGIES 0.377437 0.056357
COMMERCIAL ART AND GRAPHIC DESIGN 0.374356 0.096798
SPECIAL NEEDS EDUCATION 0.366177 0.041508
FINANCE 0.355469 0.060686
ARCHITECTURAL ENGINEERING 0.350442 0.061931
INDUSTRIAL AND MANUFACTURING ENGINEERING 0.343473 0.042876
CONSTRUCTION SERVICES 0.342229 0.060023
CHEMICAL ENGINEERING 0.341631 0.061098
ECONOMICS 0.340825 0.099092
ENGLISH LANGUAGE AND LITERATURE 0.339671 0.087724
ELECTRICAL ENGINEERING TECHNOLOGY 0.325092 0.087557
GEOLOGICAL AND GEOPHYSICAL ENGINEERING 0.324838 0.075038
OPERATIONS LOGISTICS AND E-COMMERCE 0.322222 0.047859
TRANSPORTATION SCIENCES AND TECHNOLOGIES 0.321296 0.072725
BIOLOGICAL ENGINEERING 0.320784 0.087143
MATERIALS SCIENCE 0.310820 0.023043
COMMUNICATIONS 0.305109 0.075177

SQL is a powerful language for accessing data and we hope you got a taste for it in this post If you’d like to learn more, we encourage you to check out the SQL Fundamentals course, from which this blog post is based on. In the course, we dive into how to:

  • calculate summary statistics
  • segment data using grouping
  • write more complex queries using subqueries
  • create your own local SQLite database and query it using Python

In the course, you’ll continue to work with salary data from recent college graduates as well as data from the CIA World Factbook.