Will’s Noise

This post was originally published here

Will's NoiseBowl Game Pick 'em ResultsOn taking things too seriously: holiday editionElote: a python package of rating systemsRipyr: sampled metrics on datasets using python's asyncioCategory Encoders v1.2.5 ReleaseStanding Peachtree ParkData Science Things Roudup #11Modernizing Pedalwrencher: whatever that means.Git-pandas caching for faster analysisCategory Encoders v1.2.4 Release

http://www.willmcginnis.com Data Science, Technology, Atlanta Mon, 25 Dec 2017 00:52:27 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.1 104419944 http://www.willmcginnis.com/2017/12/24/bowl-game-pick-em-results/ http://www.willmcginnis.com/2017/12/24/bowl-game-pick-em-results/#respond Sun, 24 Dec 2017 14:43:21 +0000

Bowl Game Pick 'em Results


If you haven't read my previous post on picking bowl game winners with elote, this may not make a whole lot of sense, but basically I wrote a rating system, trained it on the college football season thus far, and… Continue Reading

The post Bowl Game Pick 'em Results appeared first on Will's Noise.

]]>

If you haven't read my previous post on picking bowl game winners with elote, this may not make a whole lot of sense, but basically I wrote a rating system, trained it on the college football season thus far, and used it to predict winners for every bowl game. In this post, I'm tracking how it did.

I'll update this as the games happen, currently we're at 6/15 wins, not great.

All matchups are in the format: “team I picked to win – team I picked to lose”

Matchup Win / Loss
West Virginia – Utah
N Illinois – Duke
UCLA – Kansas State
Florida State – Southern Miss
BC – Iowa
Purdue – Arizona
Missou – Texas
Navy – Virginia
Oklahoma St. – VT
TCU – Stanford
Michigan St. – Washington St.
Texas A&M – Wake Forest
Arizona St. – NC State
Northwestern – Kentucky
New Mexico State – Utah State
Ohio State – USC
Miss St. – Louisville
Memphis – Iowa St.
Penn St. – Washington
Miami – Wisconsin
USCe – Michigan
Auburn – UCF
LSU – Notre Dame
Oklahoma – UGA
Clemson – Alabama
N. Texas – Troy WRONG
Georgia State – WKU RIGHT
Oregon – Boise St. WRONG
Colorado St. – Marshall WRONG
Arkansas St. – MTSU WRONG
Grambling – NC A&T WRONG
Reinhardt – St. Francis IN WRONG
FL Atlantic – Akron RIGHT
SMU – Louisiana Tech WRONG
Florida International – Temple WRONG
Ohio – UAB RIGHT
Wyoming – C. Michigan RIGHT
S. Florida – Texas Tech RIGHT
Army – SDSU RIGHT
Toledo – App State WRONG
Houston – Fresno State

The post Bowl Game Pick 'em Results appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/12/24/bowl-game-pick-em-results/feed/ 0 790 http://www.willmcginnis.com/2017/12/09/taking-things-seriously-holiday-edition/ http://www.willmcginnis.com/2017/12/09/taking-things-seriously-holiday-edition/#respond Sat, 09 Dec 2017 17:11:34 +0000

On taking things too seriously: holiday edition


For some reason Atlanta got a pretty significant amount of snow yesterday, and because of that I've been mostly stuck at home. When faced with that kind of time on hand, sometimes I spend too much time on things that… Continue Reading

The post On taking things too seriously: holiday edition appeared first on Will's Noise.

]]>

For some reason Atlanta got a pretty significant amount of snow yesterday, and because of that I've been mostly stuck at home. When faced with that kind of time on hand, sometimes I spend too much time on things that don't really matter all that much. Recently, I've been fascinated with rating systems (see a post on Elote here), so that was in the front of my mind this week.

Every year, around this time, my family does a college football bowl game pick 'em pool. We all pick who we think is going to win each respective bowl game, and whoever gets the most right at the end of it all (weighted by the tier of game it is sometimes), wins a prize. The prize is unimportant, what's important is that I've never won.  And that bothers me.

So for the past day I've been continuing to develop elote, a python package for developing rating systems, and two complimentary projects that I just published:

  1. keeks: a python package for bankroll allocation strategies, like the Kelly Criterion
  2. keeks-elote: a python package for backtesting coupled rating systems and bankroll allocation strategies

So with all 3 of these, some historical odds data and the data for this season of college football games, I can develop a rating system capable of ranking football teams at each week of the season, a prediction component which estimates likelihood of victory between any two teams using those rankings, a bankroll allocation strategy to turn those estimates and odds into a set of bets, and backtesting system to evaluate the whole thing. That sounds like a lot, because it is.

So here's what the script actually looks like at the end (I recommend reading the elote post before this if you haven't already):

from elote import LambdaArena, EloCompetitor, ECFCompetitor, GlickoCompetitor, DWZCompetitor
from keeks import KellyCriterion, BankRoll, Opportunity, AllOnBest
from keeks_elote import Backtest
import datetime
import json


# we already know the winner, so the lambda here is trivial
def func(a, b):
    return True


# the matchups are filtered down to only those between teams deemed 'reasonable', by me.
filt = {x for _, x in json.load(open('./data/cfb_teams_filtered.json', 'r')).items()}
games = json.load(open('./data/cfb_w_odds.json', 'r'))

# batch the games by week of year
games = [(datetime.datetime.strptime(x.get('date'), '%Y%m%d'), x) for x in games]
start_date = datetime.datetime(2017, 8, 21)
chunks = dict()
for week_no in range(1, 20):
    end_date = start_date + datetime.timedelta(days=7)
    chunks[week_no] = [v for k, v in games if k > start_date and k <= end_date]
    start_date = end_date

# set up the objects
arena = LambdaArena(func, base_competitor=GlickoCompetitor)
bank = BankRoll(10000, percent_bettable=0.05, max_draw_down=1.0, verbose=1)
# strategy = KellyCriterion(bankroll=bank, scale_bets=True, verbose=1)
strategy = AllOnBest(bankroll=bank, verbose=1)

backtest = Backtest(arena)

# simulates the betting
backtest.run_explicit(chunks, strategy)

# prints projected results of the next week based on this weeks rankings
backtest.run_and_project(chunks)

All of this, including the source data, is in the repo for keeks-elote, under examples.

So to begin with we are basically just setting up our data. Keeks-elote takes data of the form:

{
    period: [
        {
            "winner": label,
            "loser": label,
            "winner_odds": float,
            "loser_odds": float
        },
        ...
    ],
    ...
}

So each week of the season is a period, and each game that week is a nested blob with winner and loser indicated, and odds if we have them. Keeks-elote will iterate through the weeks, making bets and then updating the rankings based on the results of the week.

As the user, you can see we really only have to define a few things once the data is in the correct format:

  • Arena: we need to define a lambda arena, which will take in data as passed in. As I work with some more datasets, I expect that this can be handled under the hood by the backtester, but we will see.
  • Bankroll: the bankroll is only needed if you are making a strategy, which is only needed if you are going to use run explicit to simulate bets. It takes a starting value, you can optionally set a max drawdown percentage to quit at, and a percentage of the total to bet each period.
  • Strategy: the strategy is what converts likelihood and odds to a set of bets. Currently there are two types implemented, both shown here. Kelly Criterion attempts to be clever, AllOnBest just puts the max amount bettable on the bet with highest likelihood to be correct.

As configured in that script, with the data I have, I get this output (betting doesn't start until we have a few weeks of ratings):

running with week 1

running with week 2

running with week 3

running with week 4
evaluating 500.0 on Opportunity: Buffalo over FL_Atlantic
depositing 384.62 in winnings
bankroll: 10384.62

running with week 5
evaluating 519.23 on Opportunity: Stanford over Arizona_St
depositing 51.92 in winnings
bankroll: 10436.54

running with week 6
evaluating 521.83 on Opportunity: W_Michigan over Buffalo
depositing 177.49 in winnings
bankroll: 10614.03

running with week 7
evaluating 530.7 on Opportunity: Arkansas_St over Coastal_Car
depositing 63.71 in winnings
bankroll: 10677.74

running with week 8
evaluating 533.89 on Opportunity: Colorado_St over New_Mexico
depositing 144.29 in winnings
bankroll: 10822.04

running with week 9
evaluating 541.1 on Opportunity: Notre_Dame over NC_State
depositing 184.05 in winnings
bankroll: 11006.08

running with week 10
evaluating 550.3 on Opportunity: Arkansas over Coastal_Car
depositing 16.51 in winnings
bankroll: 11022.59

running with week 11
evaluating 551.13 on Opportunity: Oklahoma over TCU
depositing 181.89 in winnings
bankroll: 11204.49

running with week 12
evaluating 560.22 on Opportunity: Wake_Forest over NC_State
depositing 368.57 in winnings
bankroll: 11573.05

running with week 13
evaluating 578.65 on Opportunity: Washington over Washington_St
depositing 162.09 in winnings
bankroll: 11735.14

running with week 14
evaluating 586.76 on Opportunity: Boise_St over Fresno_St
depositing 152.41 in winnings
bankroll: 11887.54

Seems reasonable to me.

So before I get to my bowl picks from this system, these projects are pretty fun, and we can make some interesting projections on a lot of things, both within and outside of sports. If you're interested in this kind of thing, comment here or find any of the projects on github and get involved:

Ok, here's the picks based on Glicko1 ratings, which performed well in backtests (and more importantly has Auburn winning and Alabama losing), I'll do another post in about a month with how we did:

Winner Loser
West Virginia Utah
N Illinois Duke
UCLA Kansas State
Florida State Southern Miss
BC Iowa
Purdue Arizona
Missou Texas
Navy Virginia
Oklahoma St. VT
TCU Stanford
Michigan St. Washington St.
Texas A&M Wake Forest
Arizona St. NC State
Northwestern Kentucky
New Mexico State Utah State
Ohio State USC
Miss St. Louisville
Memphis Iowa St.
Penn St. Washington
Miami Wisconsin
USCe Michigan
Auburn UCF
LSU Notre Dame
Oklahoma UGA
Clemson Alabama
N. Texas Troy
Georgia State WKU
Oregon Boise St.
Colorado St. Marshall
Arkansas St. MTSU
Grambling NC A&T
Reinhardt St. Francis IN
FL Atlantic Akron
SMU Louisiana Tech
Florida International Temple
Ohio UAB
Wyoming C. Michigan
S. Florida Texas Tech
Army SDSU
Toledo App State
Houston Fresno State

The post On taking things too seriously: holiday edition appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/12/09/taking-things-seriously-holiday-edition/feed/ 0 785 http://www.willmcginnis.com/2017/12/06/elote-python-package-rating-systems/ http://www.willmcginnis.com/2017/12/06/elote-python-package-rating-systems/#respond Wed, 06 Dec 2017 15:00:34 +0000

Elote: a python package of rating systems


Recently I've been interesting in rating systems.  Around here the application most front of mind for those is college football rankings. In general, imagine any case you have a large population of things you want to rank, and only a… Continue Reading

The post Elote: a python package of rating systems appeared first on Will's Noise.

]]>

Recently I've been interesting in rating systems.  Around here the application most front of mind for those is college football rankings. In general, imagine any case you have a large population of things you want to rank, and only a limited set of head-to-head matchups between those things to use for building your ratings. Without a direct comparison point between each possible pair, you've got to try to be clever.

Another classical example of a rating system is in chess rankings, and its in that domain that Arpad Elo developed his rating system, the Elo Rating system.  Since that Elo rating has been used in many other domains, and both before and after Elo there have been tons of other rating systems out there.  So with that in mind, I wanted to build a nice python package with a good digestable API that implements lots of these systems. I find it much easier to grok things like this when they can be run respectively on the same dataset anyway.

So here's elote: a python package for rating systems. So far I've only implemented Elo and the first version of the Glicko rating system, but I think I've got the structure in a way that makes sense.  Elote is broken down into a few main concepts:

  • Competitors: a competitor is a “thing” which you would like to rank. If you're ranking college football teams, you'd have one competitor per team. The different rating systems are implemented at the competitor level, but as you'll see soon, that's largely abstracted away. In concept, all a competitor is is a hashable python object that you can use to identify the “thing”, usually just a string label.
  • Bouts: a bout is a head to head matchup between two competitors, generally defined by some lambda function that takes in two competitors and returns True if the first wins, False if the second wins or None if it's a draw.
  • Arenas : an arena is the part that ties it all together, a central object that creates the competitors, takes in the lambda function and a list of bouts, then evaluates everything. State can be saved from arenas, so you can do 10 bouts, save, then do 10 more later if you want.

Now, a simple example:

from elote import LambdaArena
import json
import random


# sample bout function which just compares the two inputs
def func(a, b):
    return a > b

matchups = [(random.randint(1, 10), random.randint(1, 10)) for _ in range(1000)]

arena = LambdaArena(func)
arena.tournament(matchups)

print(json.dumps(arena.leaderboard(), indent=4))

So here we are using the lambda arena to evaluate a bunch of bouts where the competitor labels are just random integers, and the lambda is just evaluating greater than. So the competitors are numbers, and larger numbers win, we'd expect to end up with a ranking that looks a lot like a sorted list from 1 to 10.  Notice that we don't do anything with competitors here, the arena creates them for us and manages them fully, here using the default EloCompetitor for Elo Rating.

Finally, we pass the bouts to the arena using arena.tournament() and dump the leaderboard, yielding something like this:

[
    {
        "rating": 560.0,
        "competitor": 1
    },
    {
        "rating": 803.3256886926524,
        "competitor": 2
    },
    {
        "rating": 994.1660057704563,
        "competitor": 3
    },
    {
        "rating": 1096.0912814220258,
        "competitor": 4
    },
    {
        "rating": 1221.000354671287,
        "competitor": 5
    },
    {
        "rating": 1351.4243548137367,
        "competitor": 6
    },
    {
        "rating": 1401.770230395329,
        "competitor": 7
    },
    {
        "rating": 1558.934907485894,
        "competitor": 8
    },
    {
        "rating": 1607.6971796462033,
        "competitor": 9
    },
    {
        "rating": 1708.3786662956998,
        "competitor": 10
    }
]

And there we have it, a very slow sort function!

There's a bunch more that we can do though in elote. For the full list, check out the examples here in the repo. But while we are here, we can also skip the arena, and interact with competitors directly, even using them to predict liklihood of future matchups:

from elote import EloCompetitor

good = EloCompetitor(initial_rating=400)
better = EloCompetitor(initial_rating=500)

print('probability of better beating good: %5.2f%%' % (better.expected_score(good) * 100, ))
print('probability of good beating better: %5.2f%%' % (good.expected_score(better) * 100, ))

good.beat(better)

print('probability of better beating good: %5.2f%%' % (better.expected_score(good) * 100, ))
print('probability of good beating better: %5.2f%%' % (good.expected_score(better) * 100, ))

We can save the state from an arena and re-instantiate a new one pretty simply (here using GlickoCompetitor instead of Elo):

saved_state = arena.export_state()
arena = LambdaArena(func, base_competitor=GlickoCompetitor, initial_state=saved_state)

And we can change the variables used in specific rating systems across the entire set of competitors in an arena:

arena = LambdaArena(func)
arena.set_competitor_class_var('_k_factor', 50)

And that's about it for now. Still very early days, so if you're interested in rating systems and want to help out, find me on github or here and let's make some stuff.

The post Elote: a python package of rating systems appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/12/06/elote-python-package-rating-systems/feed/ 0 772 http://www.willmcginnis.com/2017/11/28/ripyr-sampled-metrics-datasets-using-pythons-asyncio/ http://www.willmcginnis.com/2017/11/28/ripyr-sampled-metrics-datasets-using-pythons-asyncio/#comments Tue, 28 Nov 2017 15:00:52 +0000

Ripyr: sampled metrics on datasets using python's asyncio


Today I'd like to introduce a little python library I've toyed around with here and there for the past year or so, ripyr. Originally it was written just as an excuse to try out some newer features in modern python:… Continue Reading

The post Ripyr: sampled metrics on datasets using python's asyncio appeared first on Will's Noise.

]]>

Today I'd like to introduce a little python library I've toyed around with here and there for the past year or so, ripyr. Originally it was written just as an excuse to try out some newer features in modern python: asyncio and type hinting. The whole package is type hinted, which turned out to be a pretty low level of effort to implement, and the asyncio ended up being pretty speedy.

But first the goal: we wanted to stream through large datasets stored on disk in a memory efficient way, and parse out some basic metrics from them. Things like cardinality, what a date field's format might be, an inferred type, that sort of thing.

It's an interesting use-case, because in many cases, pandas is actually really performant here. The way pandas pulls data off of disk into a dataframe can balloon memory consumption for a short time, making analysis on very large files prohibitive, but short of that it's pretty fast and easy.  So keep that in mind if you're dealing with smaller datasets, YMMV.

So using asyncio to lower memory overhead is a great benefit, but additionally, I wanted to make a nicer interface for the developer. Anyone who's written a lot of pandas-based code can probably parse what this is doing (and it probably can be done in some much nicer ways), but it's not super pretty:

df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
report = {
    "columns": df.columns.values.tolist(),
    "metrics": {
        "A": {
            "count": df.shape[0]
        },
        "B": {
            "count": df.shape[0],
            "approximate_cardinality": len(pd.unique(df['B']))
        },
        "C": {
            "count": df.shape[0],
            "approximate_cardinality": len(pd.unique(df['C'])),
            "max": df['C'].max()
        },
        "D": {
            "count": df.shape[0]
        },
        "date": {
            "count": df.shape[0],
            "estimated_schema": 'pandas-internal'
        }
    }
}

The equivalent with ripyr is:

cleaner = StreamingColCleaner(source=CSVDiskSource(filename='sample.csv'))
cleaner.add_metric_to_all(CountMetric())
cleaner.add_metric('B', [CountMetric(), CardinalityMetric()])
cleaner.add_metric('C', [CardinalityMetric(size=10e6), MaxMetric()])
cleaner.add_metric('date', DateFormat())
cleaner.process_source(limit=10000, prob_skip=0.5)
print(json.dumps(cleaner.report(), indent=4, sort_keys=True))

I think that is a lot more readable. In that second to last line you'll also see a 3rd and final cool thing we do in ripyr. So in this example, we will end up with a sampling of 10,000 records regardless of the total size of the dataset, and will skip rows as the yield of disk with a probability of 50%. So running this analysis on just the first N rows, or a random sampling of M rows, is super super easy. In many cases, that's all you need to do, so we don't need to be pulling huge datasets into memory in the first place.

Currently there are two supported source types:

  • CSVDisk: a CSV on disk, backed by the standard python CSVReader
  • JSONDisk: a file of one-json-blob-per-row data, without newlines in the json itself anywhere.

For a given source, you can apply any of a number of metrics:

  • categorical
    • approximate cardinality, based on a bloom filter
  • dates
    • date format inference
  • inference
    • type inference
  • numeric
    • count
    • min
    • max
    • histogram

It's not on PyPI or anything yet, and isn't super actively developed, since it was mostly just a learning exercise for asyncio and type hinting, but if you think it's interesting, find me on github, or comment below. I'd be happy to collaborate with others that find this sort of thing useful.

https://github.com/predikto/ripyr

The post Ripyr: sampled metrics on datasets using python's asyncio appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/11/28/ripyr-sampled-metrics-datasets-using-pythons-asyncio/feed/ 1 778 http://www.willmcginnis.com/2017/11/22/category-encoders-v1-2-5-release/ http://www.willmcginnis.com/2017/11/22/category-encoders-v1-2-5-release/#respond Wed, 22 Nov 2017 15:00:57 +0000

Category Encoders v1.2.5 Release


This release was actually cut a couple of weeks ago, but I forgot to put a post here. It's been a release of mainly incremental changes, but also one of increased contributions from the community, so while not a huge… Continue Reading

The post Category Encoders v1.2.5 Release appeared first on Will's Noise.

]]>

This release was actually cut a couple of weeks ago, but I forgot to put a post here. It's been a release of mainly incremental changes, but also one of increased contributions from the community, so while not a huge feature-packed release, it's one I'm particularly proud of.  Here's to more like this.

It was around 4 months since the last release, which I think is a pretty decent cadence, considering our level of development.

Some highlights:

  • Andrethrill did some work to make the usage of binary encoding more stable when training/transforming on datasets with different counts of categories
  • The same thing got done in BaseNEncoder
  • Cameron Davison updated the type coercion code for Pandas DataFrames was changed to quiet some deprecation warnings.
  • Cameron Davison also did some work to ensure consistent ordering of categories in the ordinal encoder, and the encoders which use it.
  • HBGHHY added leave-one-out encoding, a new method for us, found on Kaggle.

So if you haven't used it already, check out category encoders, it's great. If you do use it and like it, hop on over to github and join us, there's always something new to work on.

https://github.com/scikit-learn-contrib/categorical-encoding

The post Category Encoders v1.2.5 Release appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/11/22/category-encoders-v1-2-5-release/feed/ 0 769 http://www.willmcginnis.com/2017/11/20/standing-peachtree-park/ http://www.willmcginnis.com/2017/11/20/standing-peachtree-park/#respond Mon, 20 Nov 2017 23:38:55 +0000

Standing Peachtree Park


I find it very easy to forget the world we live in. I grew up in North Atlanta, near the Chattahoochee, and spent a huge portion of it riding bikes to the river, going to some park on the river,… Continue Reading

The post Standing Peachtree Park appeared first on Will's Noise.

]]>

I find it very easy to forget the world we live in. I grew up in North Atlanta, near the Chattahoochee, and spent a huge portion of it riding bikes to the river, going to some park on the river, or otherwise being around it.  I've lived intown now as an adult for some years, and kind of forgot it was there, to be totally honest.  Our bridges are too wide to see it even when you drive over it.  We pull 180 million gallons of water out of it every day to bathe, drink, and cook but don't think about it all that much.

It recently came to my attention that the water treatment plant I ride my bike past most weeks has a park in the middle of it, sorta. It's got a strange, unwelcoming gate, lined with barbed wire and fronted with a keypad. It seems a whole lot like its not a park, but sure enough, that's where Standing Peachtree Park is.

It sits at the confluence of Peachtree Creek and the Chattahoochee, on the former grounds of Fort Peachtree, and before that of a Creek Indian trading post: Standing Peachtree. The Creeks made the two mistakes of being native american and siding with the British in the war of 1812, which inspired the idea to tear down their trading post, make a fort to control them, and connect it to Fort Daniel (near Hog Mountain) with what is now Old Peachtree Road (and a shitload of other Peachtree-something roads).

It's an interesting park, but I wouldn't call it great. It's mostly a weird kind of trail or dirt road that snakes between a water treatment facility and Peachtree Creek. It's kind of overgrown so it's hard to see the creek, but you can hear it. The trail isn't marked all that well, so if you see a sign that says no cars past this point, that's the trail.  Once you get past the government-y field of fencing and wire, you drop into a wooded area that does start to feel like nature for a moment.

Eventually you get to a clearing with another barbed wire fence and a sign that simultaneously pulls you out of the forest into some strange industrial trespassing vibe, and reminds you that you may, in fact, be in a public space (a public space with many specific rules). I have to wonder why a parking sign is posted in the farthest possible place you can be in the park from a parking space.  Left of frame there is for some reason a pull-up bar, and nothing else. No benches, other signs, or anything like that.

Once you get past the vague feeling of trespassing, you can look past the sign and see the thing we were looking for in the first damn place, the river. As far as I know this is the closest in-town view of the river there is. Certainly not as nice as up around Sope Creek and the Palisades, but closer the capital, for whatever that's worth.

With a little bit more courage, you can make out a little whisper of a path through the brush, in the middle of that photo. If you take that past the tree line, and then down a steep, slippery embankment, you find yourself on the bank of the Chattahoochee, and the bank of Peachtree Creek, right in the apex of the confluence itself.  Downstream in the foreground you can see the CSX rail line proposed to be refitted to connect the Silver Comet Trail to the Beltline (more info here). Behind that is the Atlanta Rd. bridge, from which you see none of this if you're just driving along.

To your left, you can look up Peachtree Creek. At this point I feel like it'd be irresponsible not to mention: it's kinda stinky here. I mean there are treatment facilities basically all around you at this point, and you are, without a doubt in nature, but its not like full on nature. There's some trash on the beach.  Go with open eyes, is what I'm saying, this is where we get a ton of our drinking water and it's a pretty nice place, warts and all.

The post Standing Peachtree Park appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/11/20/standing-peachtree-park/feed/ 0 762 http://www.willmcginnis.com/2017/09/23/data-science-things-roudup-11/ http://www.willmcginnis.com/2017/09/23/data-science-things-roudup-11/#respond Sat, 23 Sep 2017 18:34:29 +0000

Data Science Things Roudup #11


Once again time for the data science things roundup, a few links of articles or projects I've stumbled across and found interesting. This is the 11th one in the extremely irregular series, so if you think it's cool, check out… Continue Reading

The post Data Science Things Roudup #11 appeared first on Will's Noise.

]]>

Once again time for the data science things roundup, a few links of articles or projects I've stumbled across and found interesting. This is the 11th one in the extremely irregular series, so if you think it's cool, check out some of the others:

This time we've got quite a diverse set of links, so without any more delay:

Keynote Address At the SEC-Rock Center on Corporate Governance Silicon Valley Initiative

This is a kind of wordy one, it's a transcript from address given by SEC Chair Mary Jo White at Stanford to the broader startup/tech/VC community from the perspective of the SEC. If you're in the startup world, and have exposure to the finance side of things, it's an interesting read into the environment your financiers are operating in. Check it out here.

An introduction to inference

An introduction to inference is Vincent Warmerdam's very graphical intro to basic bayesian inference. It's a pretty succinct and gentle introduction, which I found quite nice. Check it out here.

Altair

Altair is a declarative python interface to Vega, a statistical visualization engine.  If you've ever burned a day trying to get matplotlib to just do what you want, this might be a more user-centric alternative. Check it out here.

The post Data Science Things Roudup #11 appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/09/23/data-science-things-roudup-11/feed/ 0 633 http://www.willmcginnis.com/2017/08/13/modernizing-pedalwrencher-whatever-means/ http://www.willmcginnis.com/2017/08/13/modernizing-pedalwrencher-whatever-means/#respond Sun, 13 Aug 2017 17:36:13 +0000

Modernizing Pedalwrencher: whatever that means.


I've got a side project that I've maintained (badly) for the past couple of years, pedalwrencher.com.  It's a pretty simple idea, if you ride bikes, and use strava.com, you can sign up with pedalwrencher and set up mileage based alerts.… Continue Reading

The post Modernizing Pedalwrencher: whatever that means. appeared first on Will's Noise.

]]>

I've got a side project that I've maintained (badly) for the past couple of years, pedalwrencher.com.  It's a pretty simple idea, if you ride bikes, and use strava.com, you can sign up with pedalwrencher and set up mileage based alerts. So if you want to replace you chain every 2000 miles, you can get an email or SMS message (via twilio) every 2000 miles with that reminder.  Pretty straight forward.

Architecturally, it was originally built as a single flask app backed by PostgreSQL, running on Heroku.  Separately there was a tiny EC2 box with a cron job running to do the actual batch processing to send out notifications once an hour.  This was a pretty simple way to get things up and running, but the reality of a side project is that it's not getting babysat. If I get busy and am not riding personally, things can break, the cron job can fail and I may not know about it.  And that's basically what happened. I didn't have any kind of good monitoring or error reporting set up, so something silently failed for a couple of months before I started getting some user emails about it.

So I took a day a couple of weeks ago and decided to migrate the existing app onto a more “modern” deployment model. Mostly just to get some experience with newer technologies but also to get things back up and running and more observable.

The general plan was:

  • Dockerize the flask app
  • Migrate it onto AWS ECS
  • Move the domain to Route 53
  • Set up autoscaling on ECS cluster
  • Dockerize the batch processing job
  • Add it at a scheduled task in ECS
  • Migrate database from Heroku Postgresql to AWS RDS
  • Write a job to check database status for stale data, Dockerize and schedule
  • Forward logs from all 3 docker containers to Cloudwatch
  • Setup Cloudwatch alerts

It was, admittedly, a long day, but in all honesty took one day, so not the most difficult migration out there.

The docker containers gave me repeatable builds with pinned versions of everything, so that I could more accurately test performance of the app locally and have confidence in the deployments once out.  Moving onto ECS let me host those docker containers in a straightforward way with a nice build/deploy pipeline.  It also made autoscaling straight forward should I for some reason get users that needed it, but no way that actually happens with this project.  The schedule task option lets me just use the containers the few hours a day I need, instead of having a dedicated EC2 box for a little cron job.  Moving Route53 and the database over is more for simplicity of management than anything.

The most annoying part, was probably moving the domain, but once done it should be cheaper to maintain because of AWS offering free SSL certs, that was expensive on Heroku. All in all now, total cost is roughly the same, its hard to judge exactly because I have some other projects running on the same ECS cluster and RDS instance.

Most importantly, the app itself is up and running, the processing job sends me emails via Cloudwatch if it breaks, and the daily job to check database status emails me with any issues (stale data, no recent pulls from Strava, etc) nightly.

I know this has been light on detail, but over the next couple of weeks I'm hoping to dig into the architecture in a little bit more detail and with some code samples, to show how to deploy a python/flask app onto ECS cheaply and with background tasks/database/https/logging/autoscaling.  So suffice to say, more detail later.

If there is any particular aspect of this kind of architecture that you're interested in, leave a comment below and I can take a look at that first.

The post Modernizing Pedalwrencher: whatever that means. appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/08/13/modernizing-pedalwrencher-whatever-means/feed/ 0 751 http://www.willmcginnis.com/2017/07/25/git-pandas-caching-faster-analysis/ http://www.willmcginnis.com/2017/07/25/git-pandas-caching-faster-analysis/#respond Wed, 26 Jul 2017 01:16:32 +0000

Git-pandas caching for faster analysis


Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories.  It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow,… Continue Reading

The post Git-pandas caching for faster analysis appeared first on Will's Noise.

]]>

Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories.  It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we've made that work with running analyses offline, and by sampling, but really most of the work run-to-run is repeated.

Enter caching. There are a few places in the codebase that we can cache result-sets by revision key and get pretty significant performance boosts when using the library in something like gitnoc. And it turns out, it's pretty straight forward.

Currently in develop, we've got a new module with a custom python decorator to handle caching by different mechanisms:

def multicache(key_prefix, key_list, skip_if=None):
    def multicache_nest(func):
        def deco(self, *args, **kwargs):
            if self.cache_backend is None:
                return func(self, *args, **kwargs)
            else:
                if skip_if is not None:
                    if skip_if(kwargs):
                        return func(self, *args, **kwargs)

                key = key_prefix + self.repo_name + '_'.join([str(kwargs.get(k)) for k in key_list])
                try:
                    if isinstance(self.cache_backend, EphemeralCache):
                        ret = self.cache_backend.get(key)
                        return ret
                    elif isinstance(self.cache_backend, RedisDFCache):
                        ret = self.cache_backend.get(key)
                        return ret
                    else:
                        raise ValueError('Unknown cache backend type')
                except CacheMissException as e:
                    ret = func(self, *args, **kwargs)
                    self.cache_backend.set(key, ret)
                    return ret

        return deco
    return multicache_nest

It looks pretty convoluted, but ends up being pretty useful.  It creates a decorator that we can use on any method in the Repository class, where one can specify a caching_backend (currently we have in-memory-ephemeral and redis based options), a key_prefix to use, a list of kwarg keys to use in the cache key, and optionally a lambda function to apply over the kwargs that returns whether to skip caching.

The lambda is in particular useful for cases we have like not wanting to cache the results for rev='HEAD', since that can change moment to moment.

Each of the two caching backends implements your basic get/set/purge functionality, and lets you set a maximum number of keys to have something like an LRU cache.

One interesting nugget from the Redis cache was that the objects we are caching are always, in git-pandas, pandas dataframes. To store those in Redis we can serialize/deserialize the dfs with:

# self._cache is a connection to redis
self._cache.set(k, v.to_msgpack(compress='zlib'), ex=self.ttl)
df = pd.read_msgpack(self._cache.get(k))

It's still being tested, and is probably one of the last things we will cram in before releasing git-pandas 2.0.0, so check out the repository over on github, try it out, and let me know what you think.

The post Git-pandas caching for faster analysis appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/07/25/git-pandas-caching-faster-analysis/feed/ 0 746 http://www.willmcginnis.com/2017/07/12/category-encoders-v1-2-4-release/ http://www.willmcginnis.com/2017/07/12/category-encoders-v1-2-4-release/#respond Wed, 12 Jul 2017 14:09:30 +0000

Category Encoders v1.2.4 Release


I've just cut a fresh release of the scikit-learn-contrib library, category_encoders.  This one included a lot of great contributions from the broader community, which has been really great. A few selected features now available: Leave-one-out encoding: a new encoder, based… Continue Reading

The post Category Encoders v1.2.4 Release appeared first on Will's Noise.

]]>

I've just cut a fresh release of the scikit-learn-contrib library, category_encoders.  This one included a lot of great contributions from the broader community, which has been really great. A few selected features now available:

  • Leave-one-out encoding: a new encoder, based on a popular Kaggle post by Owen Zhang, detailed here and here. (proposal)
  • Maintenance fixes in upstream libraries (should get fewer pandas warnings, issue)
  • Bugfix for calling fit on the same thing many times (issue)
  • Consistent category ordering (proposal)
  • Consistent output shape for datasets with inconsistent category appearances (issue)
  • Missing value and unknown category handling made consistent across all encoders.

Install or upgrade using the command:

pip install -U category_encoders

All in all a fairly large release by our standards, and there are still some issues open to be worked on. So upgrade, try it out, let me know what you think, and if you'd like to get involved, find us on github here.

The post Category Encoders v1.2.4 Release appeared first on Will's Noise.

]]>
http://www.willmcginnis.com/2017/07/12/category-encoders-v1-2-4-release/feed/ 0 743

Related Posts

Local Interpretable Model-agnostic Explanations – LIME in Python When working with classification and/or regression techniques, its always good to have the ability to ‘explain’ what your model is doing. ...
Python – TechEuler Python – TechEulerUnOrdered Linked list – Prepend, Append, Insert At, Reverse, Remove, SearchUse of __slots__ in Python ClassUsage of Unde...
Introduction to Python Ensembles Stacking models in Python efficiently Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually...
Postgres Internals: Building a Description Tool In previous blog posts, we have described the Postgres database and ways to interact with it using Python. Those posts provided the basics, but if you...

Leave a Reply

Be the First to Comment!

Notify of
avatar
wpDiscuz