Repeated measures ANOVA using Python

A common method in experimental psychology is within-subjects designs. One way to analysis the data collected using within-subjects designs are using repeated measures ANOVA. I recently wrote a post on how to conduct a repeated measures ANOVA using Python and rpy2. I wrote that post since the great Python package statsmodels do not include repeated measures ANOVA. However, the approach using rpy2 requires R statistical environment installed.  Recently, I found a python library called pyvttbl whith which you can do within-subjects ANOVAs.  Pyvttbl enables you to create multidimensional pivot tables, process data and carry out statistical tests. Using the method anova on pyvttbl’s DataFrame we can carry out repeated measures ANOVA using only Python.

Why within subject designs?

There are, at least, two of the advantages using within-subjects design. First, more information is obtained from each subject in a within-subjects design compared to a between-subjects design. Each subject is measured in all conditions, whereas in the between-subjects design, each subject is typically measured in one or more but not all conditions. A within-subject design thus requires fewer subjects to obtain a certain level of statistical power. In situations where it is costly to find subjects this kind of design is clearly better than a between-subjects design. Second, the variability in individual differences between subjects is removed from the error term. That is, each subject is his or her own control and extraneous error variance is reduced.

Repeated measures ANOVA in Python

Installing pyvttbl

pyvttbl can be installed using pip:

pip install pyvttbl

If you are using Linux you may need to add ‘sudo’ before the pip command. This method installs pyvttbl and, hopefully, any missing dependencies.

Python script

I continue with simulating a response time data set. If you have your own data set you want to do your analysis on you can use the method “read_tbl” to load your data from a CSV-file.

from numpy.random import normal
import pyvttbl as pt
from collections import namedtuple

N = 40
P = ["noise","quiet"]
rts = [998,511]
mus = rts*N

Sub = namedtuple('Sub', ['Sub_id', 'rt','condition'])               
df = pt.DataFrame()
for subid in xrange(0,N):
    for i,condition in enumerate(P):
        df.insert(Sub(subid+1,
                     normal(mus[i], scale=112., size=1)[0],
                           condition)._asdict())     

Conducting the repeated measures ANOVA with pyvttbl is pretty straight forward. You just take the pyvttbl DataFrame object and use the method anova. The first argument is your dependent variable (e.g. response time), and you specify the column in which the subject IDs are (e.g., sub=’Sub_id’). Finally, you add your within subject factor(s) (e.g., wfactors). wfactors take a list of column names containing your within subject factors. In my simulated data there is only one (e.g. ‘condition’).

aov = df.anova('rt', sub='Sub_id', wfactors=['condition'])
print(aov)

Tests of Within-Subjects Effects

Measure: rt
Source   Type III Sum of Squares ε df MS F Sig. η2G Obs. SE of x̄ ±95% CI λ Obs. Power
condition Sphericity Assumed 4209536.428 1.000 4209536.428 309.093 0.000 4.165 40.000 19.042 37.323 317.019 1.000
Greenhouse-Geisser 4209536.428 1.000 1.000 4209536.428 309.093 0.000 4.165 40.000 19.042 37.323 317.019 1.000
Huynh-Feldt 4209536.428 1.000 1.000 4209536.428 309.093 0.000 4.165 40.000 19.042 37.323 317.019 1.000
Box 4209536.428 1.000 1.000 4209536.428 309.093 0.000 4.165 40.000 19.042 37.323 317.019 1.000
Error(condition) Sphericity Assumed 531140.646 39.000 13618.991                
Greenhouse-Geisser 531140.646 1.000 39.000 13618.991                
Huynh-Feldt 531140.646 1.000 39.000 13618.991                
Box 531140.646 1.000 39.000 13618.991                

As can be seen in the output table the Sum of Squares used is Type III which is what common statistical software use when calculating ANOVA (the F-statistic) (e.g., SPSS or R-packages such as ‘afex’ or ‘ez’). The table further contains correction in case our data violates the assumption of Sphericity (which in the case of only 2 factors, as in the simulated data, is nothing to worry about). As you can see we also get generalized eta squared as effect size measure and 95 % Confidence Intervals. It is stated in the docstring for the class Anova that standard Errors and 95% confidence intervals are calculated according to Loftus and Masson (1994). Furthermore, generalized eta squared allows comparability across between-subjects and within-subjects designs (see, Olejnik & Algina, 2003).

Conveniently, if you ever want to transform your data you can add the argument transform. There are several options here; log or log10, reciprocal or inverse, square-root or sqrt, arcsine or arcsin, and windsor10. For instance, if you want to use log-transformation you just add the argument “transform=’log’” (either of the previously mentioned methods can be used as arguments in string form):

aovlog = df.anova('rt', sub='Sub_id', wfactors=['condition'], transform='log')

Using pyvttbl we can also analyse mixed-design/split-plot (within-between) data. Doing a split-plot is easy; just add the argument “bfactors=” and a list of your between-subject factors. If you are interested in one-way ANOVA for independent measures see my newer post: Four ways to conduct one-way ANOVAS with Python.

Finally, I created a function that extracts the F-statistics, Mean Square Error, generalized eta squared, and the p-value the results obtained with the anova method.  It takes a factor as a string, a ANOVA object, and the values you want to extract. Keys for your different factors can be found using the key-method (e.g., aov.keys()).

def extract_for_apa(factor, aov, values = ['F', 'mse', 'eta', 'p']):
    results = {}
    for key,result in aov[(factor,)].iteritems():
        if key in values:
            results[key] = result
            
    return results

Note, the table with the results in this post was created with the private method _within_html. To create an HTML table you will have to import SimpleHTML:

import SimpleHTML
output = SimpleHTML.SimpleHTML('Title of your HTML-table')
aov._within_html(output)
output.write('results_aov.html')

That was all. There are at least one downside with using pyvttbl for doing within-subjects analysis in Python (ANOVA). Pyvttbl is not compatible with Pandas DataFrame which is commonly used. However, this may not be a problem since pyvttbl, as we have seen, has its own DataFrame method. There are also a some ways to aggregate and visualizing data using Pyvttbl. Another downside is that it seems like Pyvttbl no longer is maintained. You can find Pyvttbl documentation here.

References

Loftus, G.R., & Masson, M.E. (1994). Using confidence intervals in  within-subjects designs. The Psychonomic Bulletin & Review, 1(4),  476-490.
Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychological Methods, 8(4), 434–47. http://doi.org/10.1037/1082-989X.8.4.434

The post Repeated measures ANOVA using Python appeared first on Erik Marsja.