Git-pandas caching for faster analysis

This post was originally published here

Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories.  It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we've made that work with running analyses offline, and by sampling, but really most of the work run-to-run is repeated.

Enter caching. There are a few places in the codebase that we can cache result-sets by revision key and get pretty significant performance boosts when using the library in something like gitnoc. And it turns out, it's pretty straight forward.

Currently in develop, we've got a new module with a custom python decorator to handle caching by different mechanisms:

def multicache(key_prefix, key_list, skip_if=None):
    def multicache_nest(func):
        def deco(self, *args, **kwargs):
            if self.cache_backend is None:
                return func(self, *args, **kwargs)
                if skip_if is not None:
                    if skip_if(kwargs):
                        return func(self, *args, **kwargs)

                key = key_prefix + self.repo_name + '_'.join([str(kwargs.get(k)) for k in key_list])
                    if isinstance(self.cache_backend, EphemeralCache):
                        ret = self.cache_backend.get(key)
                        return ret
                    elif isinstance(self.cache_backend, RedisDFCache):
                        ret = self.cache_backend.get(key)
                        return ret
                        raise ValueError('Unknown cache backend type')
                except CacheMissException as e:
                    ret = func(self, *args, **kwargs)
                    self.cache_backend.set(key, ret)
                    return ret

        return deco
    return multicache_nest

It looks pretty convoluted, but ends up being pretty useful.  It creates a decorator that we can use on any method in the Repository class, where one can specify a caching_backend (currently we have in-memory-ephemeral and redis based options), a key_prefix to use, a list of kwarg keys to use in the cache key, and optionally a lambda function to apply over the kwargs that returns whether to skip caching.

The lambda is in particular useful for cases we have like not wanting to cache the results for rev='HEAD', since that can change moment to moment.

Each of the two caching backends implements your basic get/set/purge functionality, and lets you set a maximum number of keys to have something like an LRU cache.

One interesting nugget from the Redis cache was that the objects we are caching are always, in git-pandas, pandas dataframes. To store those in Redis we can serialize/deserialize the dfs with:

# self._cache is a connection to redis
self._cache.set(k, v.to_msgpack(compress='zlib'), ex=self.ttl)
df = pd.read_msgpack(self._cache.get(k))

It's still being tested, and is probably one of the last things we will cram in before releasing git-pandas 2.0.0, so check out the repository over on github, try it out, and let me know what you think.

Related Posts

Local Interpretable Model-agnostic Explanations – LIME in Python When working with classification and/or regression techniques, its always good to have the ability to ‘explain’ what your model is doing. ...
Python – TechEuler Python – TechEulerUnOrdered Linked list – Prepend, Append, Insert At, Reverse, Remove, SearchUse of __slots__ in Python ClassUsage of Unde...
Introduction to Python Ensembles Stacking models in Python efficiently Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually...
Postgres Internals: Building a Description Tool In previous blog posts, we have described the Postgres database and ways to interact with it using Python. Those posts provided the basics, but if you...