Python for data science: Getting started

Python is becoming an increasingly popular language for data science, and with good reason. It’s easy to learn, has powerful data science libraries, and integrates well with databases and tools like Hadoop and Spark. With Python, we can perform the full lifecycle of data science projects, including reading data in, analyzing data, visualizing data, and making predictions with machine learning.

In this post, we’ll walk through getting started with Python for data science. If you want to dive more deeply into the topics we cover, visit Dataquest, where we teach every component of the Python data science lifecycle in depth.

We’ll be working with a dataset of political contributions to candidates in the 2016 US presidential elections, which can be found here. The file is in csv format, and each row in the dataset represents a single donation to the campaign of a single candidate. The dataset has several interesting columns, including:

cand_nm – name of the candidate receiving the donation.
contbr_nm – name of the contributor.
contbr_state – state where the contributor lives.
contbr_employer – where the contributor works.
contbr_occupation – the occupation of the contributor.
contb_receipt_amount – the size…