TIL – Baseball Data

Ever since taking on the new position at Datalogix (now Oracle), I have been coming up to speed on a variety of topics related to AdTech. It’s been a fun, but challenging, learning curve.

One of the distinct differences in this role and every other role I have had designing and building software product is that we have two distinct sets of “builders” inside the company: engineering, and analytics. My engineering Kung Fu is pretty strong, but my analytics is, well, not so much. I have a degree in what used to be called Decision Sciences from college (now called Operations and Information Management), and we did a lot of modeling in Excel and some database work, but nothing like what current state of the art is in what today is called Data Science or Big Data.

My strong belief is that to be a great product manager, I need to be able to speak to all of my engineering counterparts, understand what they are saying, and have the ability and credibility to call BS when necessary. To get myself further up the curve, I have decided to get super nerdy with statistical learning and modeling. That’s a longer term outcome, and in the near term I have to set some more pedestrian goals. Part of that required my having a project into which I could sink my teeth. A requirement is that it is something which is interesting to me over time, and can become as difficult as I want to make it. I have decided, given the abundance of the data available, that I would target baseball analytics. Nothing groundbreaking there. There’s plenty of books and blogs out there which go into incredible depth on the topic. That should make it easier to find help when I need it, and that’s super important when trying to pick up new skills.

To get started, I pulled down the Retrosheet database, thanks to a great post walking through how to import it into MySQL. The Retrosheet data is play by play data, which is helpful for game analysis. There’s also pitch by pitch data out on the net, which I will get to at some point. I also pulled down the Lahman database, which is season summary statistics, and where I plan to start building my SQL skills back up. The discrete event data from the Retrosheet will be really helpful for building my training data when I start modeling.

Long preamble for my TIL. Today I Learned that in 2014, Colorado had the best team slugging percentage, and Detroit had the best team batting average. It took some doing to get there, but here’s the SQL query, querying against the Lahman database.

select teamID,
count(*) batters,
round(sum(H) / sum(AB),3) team_AVG,
round((sum(H) + sum(2B) + 2 * sum(3B) + 3 * sum(HR))/sum(AB), 3) team_SLG
FROM batting WHERE yearID=2014 GROUP BY teamID ORDER BY team_AVG desc;

I ran it twice, ordering by team average and team slugging percentage.

Team_SLGTeam_AVG

The Colorado Rockies having the best team slugging percentage is likely due to the thin air at their home ballpark. The next query to validate this would be to see how much better team slugging percentages where when they played at Coors Field. That’s an exercise for another day. However, to validate that I computed the right numbers, I headed over to ESPN to check team averages.

ESPN

Things checked out, which is always nice. Nothing really ground breaking in terms of analysis yet, but getting all of the data onto one computer, and being able to write queries against it and then share, well, that’s pretty astounding. The Internet really has brought about so much empowerment for individuals. I’m looking forward to spending more time in the data, and sharing along the way.