There’s a massive amount of data available on the PA Open Data Portal. Go check it out! I’ll be here when you get back.
When I started noodling around with Tableau I needed to find some real-live data to play with and learn. It had to be big enough to contain naturally occurring patterns and it had to be interesting enough to hold my attention. One of my favorite sets is the Crash Data from the Pennsylvania Department of Transportation. Over two and a half million rows of tasty, tasty information, starting back in 1997!
So maybe this isn’t dataviz rockstar level work but even so, I bet you can see something interesting. Here’s a picture of crashes involving 75+ year old drivers, in a histogram by month (where 1= January). What do you observe?
Looks like October (10) stands out as a pretty tough month. Angle-style crashes and rear-end collisions are both contributing to the higher number overall. Consider weather and road conditions for Pennsylvania in October. What do you think is happening?
Note this picture doesn’t say why, though it’s nearly irresistible to go there, isn’t it? Such is the danger. I may have played a little joke on you. Or what passes for humor to a data nerd.
Ahem. My apologies.
In the book Keeping Up with the Quants, authors Davenport and Kim describe three stages and six steps of quantitative analysis. We are only at the beginning: Framing the Problem (Problem Recognition): Why is October such an outlier in terms of crashes involving older drivers? We’d all prefer to drive on safe roads with fewer crashes, right? What’s going on there?
The next step is Review of Previous Findings, where you familiarize yourself with prior studies of the problem. That’s how I found this NYT article. Watch carefully for when the underlying data “dries up” and we go from facts to conclusions or speculation. No doubt his are well-informed opinions, and indeed this was posted to an opinion column. I offer this to you merely to illustrate the rapid jump from data to I know why.
From the same book, only after Review of Previous Findings do we move to Solving the Problem. Here we have three parts:
- Modeling (Variable Selection – what factors will you study?)
- Data Collection (Conduct your study and measure – how do your variables interact?)
- Data Analysis (Review your information – so what is it you found?)
Now these steps are beyond where we’ll go for today in this post. When presented with information and you observe a pattern, what are your next steps? Our brains leap in with all sorts of possible reasons why, and they may be well-informed or very probable. But they are not reliable facts until you have tried and tested them. Where is your data?
Have the courage to prod your conclusions for weak spots. If it’s too tough for you on your own, find your special friend the troublemaker. The one who asks annoying questions? He or she would be delighted to tell you where you’re wrong.
Demand evidence. Require data. Ask for numbers. Make this a natural, expected part of conversation. You’ll know you’re on the right track when your team says things like, “I knew you’d ask how I got this figure so I brought that as well.”
For anyone who is interested in the book, here’s a webinar that does a good job compiling and summarizing the best aspects:
https://hbr.org/2013/11/keeping-up-with-the-quants