How to do Descriptives Statistics in Python using Numpy

This post was originally published here

In this short post we are going to revisit the topic on how to carry out summary/descriptive statistics in Python. In the previous post, I used Pandas (but also SciPy and Numpy, see Descriptive Statistics Using Python) but now we are only going to use Numpy. The descriptive statistics we are going to calculate are the central tendency (in this case only the mean),  standard deviation, percentiles (25 and 75), min, and max.

Loading the data

In this example I am going to use the Toothgrowth dataset (download here). It is pretty easy to load a CSV file using the genfromtxt method:

import numpy as np

data_file = 'ToothGrowth.csv'
data = np.genfromtxt(data_file, names=True, delimiter=",", dtype=None)

Notice the arguments we pass. The first row has the names and that is why we set the argument ‘names’ to True. One of the columns, further, has strings. Setting ‘dtype‘ to None enables us to load both floats and integers into our data.

Descriptive statistics using Numpy

In the next code chunk, below, we are going to loop through each level of the two factors (i.e., ‘supp’, and ‘dose’) and create a subset of the data for each crossed level. If you are familiar with Pandas, you may notice that subsetting a Numpy ndarray is pretty simple (data[data[yourvar] == level). The summary statistics are then appended into a list.

summary_stats = []
for supp_lvl in np.unique(data['supp']):
    
    for dose_lvl in np.unique(data['dose']):
    
        # Subsetting
        data_to_sum = data[(data['supp'] == supp_lvl) & (data['dose'] == dose_lvl)]
        # Calculating the descriptives
        mean = data_to_sum['len'].mean()
        sd = data_to_sum['len'].std()
        max_supp = data_to_sum['len'].max()
        min_supp =  data_to_sum['len'].min()
        ps = np.percentile(data_to_sum['len'], [25, 75] )
        summary_stats.append((mean, sd, max_supp, min_supp, ps[0], ps[1], supp_lvl, dose_lvl))

The results

From the list of data we are going to create a Numpy array. The reason for doing this is that it will get us a bit prettier output. Especially, when we are setting the print options (line 19, below).

results = np.array(summary_stats, dtype=None)
np.set_printoptions(suppress=True)
print(results)

Results from Numpy and Python descriptive statistics

That was it. I still prefer doing my descriptives statistics using Pandas. Primarily, because of that the output is much more nicer but it’s also easier to work with Pandas dataframes compared to Numpy arrays.

The post How to do Descriptives Statistics in Python using Numpy appeared first on Erik Marsja.

Related Posts

A Dramatic Tour through Python’s Data Visualization Landscape (including ggpy and Altair) by Dan Saber | April 19, 2017 This post originally appeared on Dan Saber's blog. We thought it was hilarious, so we asked him if we could repost it....
Data Science Things Roundup #10 Hey all, I haven't done one of these in quite a while, but thought I'd share a few more articles I've found interesting recently. An analysis of twitt...
NumPy Cheat Sheet – Python for Data Science NumPy is the library that gives Python its ability to work with data at speed. Originally, launched in 1995 as ‘Numeric,’ NumPy is the fou...
Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data by Viraj Parekh | April 6, 2017 This is a basic tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even...

Leave a Reply

Be the First to Comment!

Notify of
avatar
wpDiscuz