Will McGinnis – Page 2

Dummy-Spark Release: pure python mock of spark for testing

September 19, 2016December 4, 2016 Will McGinnis

In a previous post, I mentioned a little project to provide a pure-python mock of apache spark’s RDD object for testing and quick prototyping. Thanks to some help from contributors, we’ve made a bit of progress and now a good bit of the RDD API is supported, including using the newHadoopAPI with elasticsearch-hadoop, and pulling […]

Category Encoders now on conda forge

September 17, 2016December 4, 2016 Will McGinnis

My scikit-learn compatible library of categorical data encoders (category_encoders) is now published on conda forge! Conda, if you didn’t know, is an open source package manager for python (and other things) developed primarily by continuum analytics. Thanks to continuum developer https://github.com/bollwyvl for doing pretty much all of the work to get it working. Check out the […]

Using twitter-pandas to find friends who don’t follow you back

July 28, 2016December 4, 2016 Will McGinnis

Over the past couple of months we’ve been gradually working on twitter-pandas, a pandas dataframe based interface to twitter data (powered by tweepy behind the scenes). I’ve posted about the first limited release previously here. The initial release was focused on just replicating the tweepy API as best as we could as a first building […]

Data Science Things Roundup #9

July 26, 2016December 4, 2016 Will McGinnis

Things got a bit busy and I feel off the wagon posting, but here we are back for the ninth edition of the data science things roundup. If you haven’t seen previous editions, it’s basically just 3 data science or python related articles or packages that I’ve stumbled across recently and thought were interesting. This […]

Projects Update: July 2016

July 24, 2016December 9, 2016 Will McGinnis

About a quarter ago (April), I posted my first regular update on all of the various projects I’m working on. As side projects tend to go, some fall into and out of favor, and occasionally new ones crop up. As I develop on projects, I post regular updates, but it’s helpful to me (and hopefully […]

PyData Atlanta Recap: July 2016

July 23, 2016December 4, 2016 Will McGinnis

Previously I posted a little bit about the PyData Atlanta meetup in June, where I presented a lightning talk on my project: git-pandas. The July meetup was this past Wednesday, and it was as always very interesting so I thought I would do a quick recap of what all was presented. Main The main talk this […]

PyData Atlanta Meetup: git pandas talk

July 18, 2016December 4, 2016 Will McGinnis

One of my favorite meetups in Atlanta is the PyData meetup, monthly at General Assembly (a pretty nice venue in ponce city market). I try to goto a few meetups a month around town, and this is the only one I’ve been to that is consistently well organized, interesting, and relaxed. It’s less like a […]

Twitter-Pandas, first release

July 2, 2016December 4, 2016 Will McGinnis

Thanks to some great help from contributors, we’ve just pushed the first release of twitter pandas, v0.0.1. The first release is aimed at replicating the data-providing (no create/update/delete functions) from the tweepy API with the git-pandas style pandas interface. To install twitterpandas, just use pip pip install twitterpandas And then you can use it right […]

Beyond One-Hot: incremental improvements in categorical encoding

June 17, 2016June 17, 2016 Will McGinnis

The beyond-one-hot project has started to grow up. Last fall, I did a couple of posts comparing different methods of encoding categorical variables for machine learning problems. You can check them out here and here respectively. Those posts were pretty well received, so the hacky little script that was used to make the plots got […]

Introducing unified glob-syntax in git-pandas

June 15, 2016June 15, 2016 Will McGinnis

In an effort to improve the user interface to git-pandas, I’m introducing a new way of specifying which files in a repository you care about, which will become the sole way of specifying this kind of thing in version 2.0.0. Currently, for any given function, you can specify a list of extensions you’d like to […]

Mixed-mode estimation in petersburg

June 13, 2016June 13, 2016 Will McGinnis

A couple of months ago I posted an overview of simple estimation of hierarchical events using python and petersburg. At the time it probably seemed a little bit trivial, just building a structured frequency model and drawing samples from it. But I have finally implemented the next step to complete the intended functionality. This post […]

Parallelizing cumulative blame in git-pandas with joblib

June 12, 2016June 12, 2016 Will McGinnis

It’s been a little while since I’ve posted anything about git-pandas, as I’ve been working on getting a sister project, twitter-pandas up and running. Work has continued though, and today I’d like to show a currently experimental feature, parallelized cumulative blame. Cumulative blame is one of the more popular features used in git-pandas, as it […]

Data Science Things Roundup #7

May 24, 2016November 24, 2016 Will McGinnis

This weeks edition of the Data Science Things Roundup is pretty python-heavy, as opposed to previous editions that were a bit more machine learning and dataviz heavy. At the end of the day, some kind of software is backing most of data science, so getting a bit lower level can be useful sometimes. This week […]

Twitter-Pandas: like git-pandas, but for twitter.

May 19, 2016May 19, 2016 Will McGinnis

I’ve got a python library that I’ve posted here before, that people seem to like called git-pandas. The idea is to provide a pandas-centric interface to the data in a git repository. To start with, we added simple representations of common datasets (commits, file changes, branches, etc), and as the library grew, we added in […]

Data Science Things Roundup #6

May 17, 2016November 24, 2016 Will McGinnis

Time again for the weekly data science things roundup. If you haven’t seen this before, check out some of the previous ones to get a feel for it. Each Tuesday I run through 3 things I’ve found interesting and bookmarked recently, generally related to python and data science (with some admitted diversions). This week is […]