Projects Update: April 2016

Over the past year or so I’ve made a conscious effort to put out more public facing work through this site, some side projects and through github.  Now, as a few have started to mature and take shape, I wanted to do the first of what will become an occasionally recurring kind of post: projects updates.

The idea here is to touch on all of the things I’m working on here in one place once every quarter or so. I’ll try to go through what each project is, maybe a bit about it’s intended purpose and what my plans for it are.  If you’re a reader of this blog and there’s something you’d like to contribute to, reach out to me here or on github.

Side Projects

First off, side projects.

Ce1pQGgWAAIjRvL
I’ve posted a couple of times recently about www.pedalwrencher.com.  That project was first launched about this time last year, and more or less ran unattended for 10 of those 12 months.  In recent weeks I’ve made an attempt to reboot the styling and usability to maybe grow it into something more useful.

Open Source Projects

I’ve managed to publish a pretty decent collection of projects over the past year or two, some of which are much more actively developed than others, and most of which have been touched on here at one point or another.  In no particular order:

  • git-pandas: this is my most starred repo, and is probably one of the most actively developed as well.  I’ve posted quite a bit about it, so I’ll be breif, but it is a library for analyzing git-based data using the popular dataframe package, pandas.   There are some open issues I’d like to solve here, notably support for glob syntax for ignoring or selecting particular portions of a codebase.
  • categorical_encoding: this library was originally just a script backing a blog post, but eventually grew into a standalone pip-installable package that is being used in production in at least a handful of places that I know about.  I’d like to improve testing and performance to help solidify it’s usefulness in that regard, and maybe one day reduce some of the external requirements (patsy, pandas, statsmodels, etc.) to integrate with scikit-learn without so much baggage.
  • pypi-publisher: ppp is a little command line interface for publishing things to pypi, which I use personally and professionally nearly daily, so maintenance will for sure continue there.  I’d like to continue work on integrating support for publishing documentation with gh-pages and sphinx, as well as on verification of deploys.
  • gitnoc: gitnoc was originally a really hacky flask app for playing with git-pandas.  Over time it’s evolved into a bit more polished but still pretty hacky flask app for playing with git-pandas.  It has, however, gotten to be pretty capable, and the needs of it now tend to lead the eventual requirements of git-pandas.  I use it personally very regularly, particularly the risk identification pages, so expect continued development.
  • cookiecutter-flask: I haven’t touched this in quite a while, but people seem to continue using it which is great.  Many of my side projects use this, though it’s often a bit heavyweight for what I’m doing.
  • pygeohash: pygeohash is in use in production in a number of places, and is stable, but at this point, I’m not sure if there is anything useful to add to it, so it will be in maintenance mode unless someone suggests something for it.
  • DummyRDD: this particular project is an interesting one because as-is it meets my personal needs but is clearly a long, long, long way from mocking out the whole Spark API. I’d be very interested to get some feedback or help from any contributors, it would be a great way for a python dev to learn the internals of how Apache Spark works.
  • pyculiarity: this is a pure python implementation of Twitter’s time series anomaly detection algorithm. Like a few other projects listed here it’s complete and in production, and will be maintained but there are no new features planned for it at this stage.
  • petersburg: this is totally a pet project of mine.  I’ve posted about it a few times here, and I doubt anyone is using it, but I have some big plans that may make it actually really useful for some practical data science problems.
  • flink-python-examples: This is woefully out of date since the v1.0.0 release of Apache Flink a few weeks ago, and in need of an update.  The python API for Flink is not super well supported and barely documented, so it’s a pretty decent time commitment, but I plan to work on that in coming weeks.

Miscellanea

Aside from those above, there are a few projects mentioned here, or committed to once or twice that are probably one offs or failed experiments.  Unless something changes with these, they won’t be in the next update.

The post Projects Update: April 2016 appeared first on Will’s Noise.