Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In fact, a lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job. Therefore, if you are just stepping into this field or […]
Category: Pandas
Adding Axis Labels to Plots With pandas
Pandas plotting methods provide an easy way to plot pandas objects. Often though, you’d like to add axis labels, which involves understanding the intricacies of Matplotlib syntax. Thankfully, there’s a way to do this entirely using pandas. Let’s start by importing the required libraries: import pandas as pd import numpy as np import matplotlib.pyplot as […]
Pandas Concatenation Tutorial
You’d be hard pressed to find a data science project which doesn’t require multiple data sources to be combined together. Often times, data analysis calls for appending new rows to a table, pulling additional columns in, or in more complex cases, merging distinct tables on a common key. All of these tricks are handy to […]
Using Excel with pandas
Excel is one of the most popular and widely-used data tools; it’s hard to find an organization that doesn’t work with it in some way. From analysts, to sales VPs, to CEOs, various professionals use Excel for both quick stats and serious data crunching. With Excel being so pervasive, data professionals must be familiar with […]
Using pandas with large data
Tips for reducing memory usage by up to 90% When working using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory. While tools […]
Git-pandas caching for faster analysis
Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories. It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we’ve made that work with running analyses […]
Understanding SettingwithCopyWarning in pandas
SettingWithCopyWarning is one of the most common hurdles people run into when learning pandas. A quick web search will reveal scores of Stack Overflow questions, GitHub issues and forum posts from programmers trying to wrap their heads around what this warning means in their particular situation. It’s no surprise that many struggle with this; there […]
Pandas Cheat Sheet – Python for Data Science
Pandas is arguably the most important Python package for data science. Not only does it give you lots of methods and functions that make working with data easier, but it has been optimized for speed which gives you a significant advantage compared with working with numeric data using Python’s built-in functions. The printable version of […]
Pandas Tutorial: Data analysis with Python: Part 2
We covered a lot of ground in Part 1 of our pandas tutorial. We went from the basics of pandas DataFrames to indexing and computations. If you’re still not confident with Pandas, you might want to check out the Dataquest pandas Course. In this tutorial, we’ll dive into one of the most powerful aspects of […]
Pandas Cheat Sheet for Data Science in Python
by Karlijn Willems | November 30, 2016 This post originally appeared on the DataCamp blog. Big thanks to Karlijn and all the fine folks at DataCamp for letting us share with the Yhat audience! Pandas library The Pandas library is one of the most preferred tools for data scientists to do data manipulation and analysis, […]
pandas Cheat Sheet (via yhat)
The folks over at yhat just released a cheat sheet for pandas. You can download the cheat sheet in PDF for here. There’s a couple important functions that I use all the time missing from their cheat sheet (actually….there are a lot of things missing, but its a great starter cheat sheet). A few things […]
Pandas Tutorial: Data analysis with Python: Part 1
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, and makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give you a single, convenient, place to do most of your data analysis […]
Working with SQLite Databases using Python and Pandas
SQLite is a database engine that makes it simple to store and work with relational data. Much like the csv format, SQLite stores data in a single file that can be easily shared with others. Most programming languages and environments have good support for working with SQLite databases. Python is no exception, and a library […]
Python & JSON: Working with large datasets using Pandas
Working with large JSON datasets can be a pain, particularly when they are too large to fit into memory. In cases like this, a combination of command line tools and Python can make for an efficient way to explore and analyze the data. In this post, we’ll look at how to leverage tools like Pandas […]