Ah yes, it’s that wonderful time of year again: fantasy baseball time. Anyone who plays fantasy sports can tell you that other than getting a check for winning your league (something I would know nothing about), the best part of the season is the draft.
My fantasy draft preparation always involves the creation of a spreadsheet. This nugget of gold will start with ratings and stats from various sites and sources before getting my own personal notes, organization, and shuffling (aka “messing it up”). and what kind of card-carrying software engineer would do such a compilation by hand…
My basic strategy for compiling this sucker this year:
- Scrape some stats and rankings from a few popular sites and dump them locally.
- Load the stats from local, merge em, and kick out a csv.
- Mess with stuff in the spreadsheet.
- Win millions.
I’ve used hpricot in the past but wanted a bit of a refresher — Man, this thing makes this task nice and easy. Most of the sites have a fairly sane markup scheme for the tables they store players data in, so it’s normally as simple as (this is Ruby btw):
doc = Hpricot(open(url)) players = doc.search(".playerDataRow") players.each do |player| meta = player.search("td .playerMeta") stats = player.search("td .playerStats") < do stuff > end
Update: Here are a couple example files (they should be .rb files, but are docs to make WordPress happy)
Last year, I wrote this as a single script (scrape -> csv). Things got hairy when I needed to tweak the merger and thus had to re-run the whole thing (annoying) or hack it to run only a portion… and the same portion on each site (annoying). So, this year I got a little smart and wanted to dump the scrape results locally.
I fully intended to evaluate a few options here but as it turns out, the first try was just dead simple and worked perfectly: yaml. My scrapers each end up with a couple Arrays: one for hitters and one for pitchers. Each entry is a Hash of player data. I considered skipping the Array here but didn’t want to have the logic for name collisions in the scraper.
How hard was it to write my +player_data+ Hash to yaml?
YAML.dump(player_data, File.open(File.dirname(__FILE__) + "/yaml/" + filename, "w"))
Merging the scrape sources
This is a separate script here now…
Resurrecting the player data proved as simple as storing it in the first place.
YAML.load(File.open(File.dirname(__FILE__) + "/yaml/" + filename))
First pass I just wanted to merge based on player names – ignoring the imperfections that surely come with that… I was pleasantly surprised at how well things actually came out. I wrote a little throwaway script (as if this whole thing isn’t throwaway) to tell me how much of a problem I actually have. Basically: how many players in the top 400 of any site don’t have a match from the other sites? The answer was basically: a lot of Latin guys and a few others.
So, one problem I had was character sets used in the Latin player names. This is an area where “you’re going to throw this away” came into play: I just grabbed the few codes that I was having trouble with and regex’d to replace them with their friendlier counterparts. This is now on my list of “things to figure out how to do right”.
That out of the way, I re-ran and found that I really only had problems with about 20 guys. People with names like “Mike Smith”, that were “Michael Smith” in the other set. Here again: cheated. 20 guys? I can handle typing 20 names instead of figuring this one out, so I just modified the yaml manually here (this immediately breaks down if I have to re-run the scraper, but… I didn’t).
After determining which of the dumped data I wanted to keep (and which site could trump the others when they both had something like HRs), I created my csv with FasterCSV and I was rockin’.
In the spreadsheet
At this point, I was realizing that I should have created some new fields in that script… maybe look for things like big differences between sources (why does Y! think this guy is #20 and ESPN thinks he’s #100?) or even just: what’s the average of all the sites scraped. I was enamored with my spreadsheet though and haven’t
had to been able to work with formulas in spreadsheets since my consulting-at-AmFam days, so I thought I’d give it a spin… and was quickly reminded of how easy they make it… to bash your head against the wall: