Descriptive Statistics using Python

Descriptive Statistics

After data collection, most Psychology researchers use different ways to summarise the data. In this tutorial we will learn how to do descriptive statistics in Python. Python, being a programming language, enables us  many ways to carry out descriptive statistics.
Pandas makes data manipulation and summary statistics quite similar to how you would do it in R. I believe that the dataframe in R is very intuitive to use and pandas offers a DataFrame method similar to Rs. Also, many Psychology researchers may have experience of R.

Thus, in this tutorial you will learn how to do descriptive statistics using  Pandas, but also using NumPy, and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy so we use SciPy. Towards the end we learn how get some measures of variability (e.g., variance using pandas).

import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean

Simulate response time data

Many times in experimental psychology response time is the dependent variable. I to simulate an experiment in which the dependent variable is response time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method describe.

N = 20
P = ["noise","quiet"]
Q = [1,2,3]

values = [[998,511], [1119,620], [1300,790]]

mus = np.concatenate([np.repeat(value, N) for value in values])

data = df(data = {'id': [subid for subid in xrange(N)]*(len(P)*len(Q))
,'iv1': np.concatenate([np.array([p]*N) for p in P]*len(Q))
,'iv2': np.concatenate([np.array([q]*(N*len(P))) for q in Q])
,'rt': np.random.normal(mus, scale=112.0, size=N*len(P)*len(Q))})

Descriptive statistics using Pandas

data.describe()

Pandas will output summary statistics by using this method. Output is a table, as you can see below.

Output table from Pandas DataFrame describe - descriptive statistics
Output table of data.describe()

Typically, a researcher is interested in the descriptive statistics of the IVs. Therefore, I group the data by these. Using describe on the grouped date aggregated data for each level in each IV.  As can be seen from the output it is somewhat hard to read. Note, the method unstack is used to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.

grouped_data = data.groupby(['iv1', 'iv2'])

grouped_data['rt'].describe().unstack()
Output from describe on the grouped data
Output from describe on the grouped data

Central tendancy

Often we want to know something about the “average” or “middle” of our data. Using Pandas and NumPy the two most commonly used measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean  can also be obtained using Pandas but I will use methods from  SciPy.

Mean

There are at least two ways of doing this using our grouped data. First, Pandas have the method mean;

grouped_data['rt'].mean().reset_index()

But the method aggregate in combination with NumPys mean can also be used;

grouped_data['rt'].aggregate(np.mean).reset_index()

Both methods will give the same output but the aggregate method have some advantages that I will explain later.

Output of aggregate using Numpy mean method
Output of mean and aggregate using NumPy – Mean

 

Geometric & Harmonic mean

Sometimes the geometric or harmonic mean  can be of interested. These two descriptives can be obtained using the method apply with the methods gmean and hmean (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.

Geometric
grouped_data['rt'].apply(gmean, axis=None).reset_index()
Harmonic
grouped_data['rt'].apply(hmean, axis=None).reset_index()

Trimmed mean

Trimmed means are, at times, used. Pandas or NumPy seems not to have methods for obtaining the trimmed mean. However, we can use the method trim_mean from SciPy . By using apply to our grouped data we can use the function (‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.

trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
trimmed_mean.reset_index()

Output from the mean values above (trimmed, harmonic, and geometric means):

Trimmed mean output from Pandas using SciPy
Trimmed Mean

Harmonic mean using Pandas DataFrame SciPy
Harmonic Mean

Descriptives - Geometric Mean
Geometric Mean

Median

As with the mean there are also at least two ways of obtaining the median;

grouped_data['rt'].median().reset_index()
grouped_data['rt'].aggregate(np.median).reset_index()
Output of aggregate using Numpy - Median.
Output of aggregate using Numpy – Median.

Mode

There is a method (i.e., pandas.DataFrame.mode()) for getting the mode for a DataFrame object. However, it cannot be used on the grouped data so I will use mode from SciPy:

grouped_data['rt'].apply(mode, axis=None).reset_index()

Most of the time I probably would want to see all measures of central tendency at the same time. Luckily, aggregate enables us to use many NumPy and SciPy methods. In the example below the standard deviation (std), mean, harmonic mean,  geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.

descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()

descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
descr
Descriptive statistics using Pandas, NumPy, and SciPy.
Output of aggregate using some of the methods.

Measures of variability

Central tendency (e.g., the mean & median) is not the only type of summary statistic that we want to calculate. Doing data analysis we also want a measure of the variability of the data.

Standard deviation

grouped_data['rt'].std().reset_index()

Inter quartile range

Note that here the use unstack()  also get the quantiles as columns and the output is easier to read.

grouped_data['rt'].quantile([.25, .5, .75]).unstack()
Interquartile range (IQR) using Pandas quantile
IQR

Variance

grouped_data['rt'].var().reset_index()
Variance using pandas var method
Variance

That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes these calculation almost as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.

I am sorry that the images (i.e., the tables) are so ugly. If you happen to know a good way to output tables and figures from Python (something like Knitr & Rmarkdown) please let me know.

The post Descriptive Statistics using Python appeared first on Erik Marsja.