I am the Chief Data Scientist at the Sorenson Impact Center, an applied academic institution that is part of the University of Utah’s Business School. My job is to help governments, non-profits, and philanthropists use data to better address difficult social problems. (DISCLAIMER: I don’t speak for the Center or any of its affiliates on this blog. These are my opinions.)
Previously, I served as the Chief of Staff and Chief Data Scientist to Mayor Joe Curtatone of Somerville, Massachusetts, and as an Innovations in American Government Fellow at the Harvard Kennedy School.
Bootstrapping has long been one of my favorite statistical procedures. The nonparametric version requires few assumptions, and it shares attributes with both simulation and common Bayesian models, both of which I love. At the end of the day I wonder, why settle for a point estimate and two values for the confidence intervals when you can create a distribution and visualize your CIs? When my friend Rich first introduced me to bootstrapping, he said that if statistics had been invented in the computer age, this is where most classes would start.
Book Title: Irreversible Things Author: Lisa Van Orman Hadley
I should begin this review with a caveat: I am Lisa’s husband, and make no pretensions to describe her work impartially or devoid of context. To my surprise, the character who resembles me is portrayed sparingly, but more than generously. I have no axe to grind! Instead of hiding my biases, therefore, I will lean in to them a bit, and give a sense of how my reading of Irreversible Things was made richer by the details I know about Lisa and her family.
In this post, I describe how I used Empirical Bayesian methods to estimate the accuracy of NBA three-point shooters. This analysis closely follows the process outlined by David Robinson in his excellent book Introduction to Empirical Bayes: Examples from Baseball Statistics, and is performed using his ebbr package in R.^ The goal is to make a reasoned ranking of the top sharp shooters, despite inconsistent and imperfect records of how often players make the shots they attempt.
Last night as I was dozing off, I had a sudden inclination to try and add to the attempts that have been made to identify the anonymous New York Times op-ed writer. I’ve had some success in the past with machine learning and stylometry. And this is one of the most intriguing authorship questions in years.
By 2:00 am I was convinced the data had singled out Mike Pence. I even started to wonder what the ethical thing to do is when one has hard evidence about the source of text whose author wished to remain anonymous.
At the 2018 RStudio conference in San Diego, my colleague Jon and I gave a talk about how we use R Markdown to quickly go from nothing, to analysis, to a branded report that we can pass off to clients. This workflow took some time to set up, but like most automation tasks, has ultimately saved us more time and headache than it cost. If you want to skip to the talk,
A while ago, the popular data journalism site 538 posted a challenging probability puzzle:
On the table in front of you are two coins. They look and feel identical, but you know one of them has been doctored. The fair coin comes up heads half the time while the doctored coin comes up heads 60 percent of the time. How many flips — you must flip both coins at once, one with each hand — would you need to give yourself a 95 percent chance of correctly identifying the doctored coin?
While planning a holiday gift exchange this week, my wife casually challenged me with a sort of tricky probability puzzle:
Sara and I were talking today and realized that we were off by one on the rotation because last year we went sledding instead of buying gifts.
It should actually be: * Sara gives to Jonny * Jimmy gives to Thurop * Amy gives to Lisa * Jonny gives to Sara * Thurop gives to Jimmy * Lisa gives to Amy
I had a chance to sit down with Rayid Ghani, the Director of the Center for Data Science and Public Policy at the University of Chicago. I have admired Rayid’s work ever since he became the Chief Scientist for the 2012 Barack Obama campaign. His strategic use of analytics set the precedent for a new era of data science in political campaigns. At the time, others were trying to reverse engineer his tactics, which meant he was always one step ahead of the other campaigns.
We recently wanted to brand several of our plots for publication in the local press. I looked around and found a couple suggestions on how to add images to plots, but nothing that seemed modular or customizable. My colleague reccomended the relatively new Magick package, which provided all of the functionality I needed (plus a lot more). Here is a simple example along with the code to replicate it:
library(ggplot2) library(magick) library(here) # For making the script run without a wd library(magrittr) # For piping the logo # Make a simple plot and save it ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point() + ggtitle("Cars") + ggsave(filename = paste0(here("/"), last_plot()$labels$title, ".
Recidivism, the rate at which those released from incarceration return or commit new crimes, is one of society’s most difficult social problems. The official estimate is that 55% of former prisoners will return within 60 months.
Recidivism is also, I discovered, one of the most challenging things to model and understand statistically. In this blog post, I describe our efforts to build this simulation, including how we settled on some fairly basic control structures (for loops) without giving up too much in terms of efficiency and readability.
In the marketing world, big data is used to answer ostensibly minute questions every day: are computer mouse movements predictive of purchasing? Does an orange background increase user engagement? In every place with Silicon in its name, there are teams of data scientists asking these questions.
In the social sector, by contrast, answering helpful questions is more difficult. For instance, is our program reducing homelessness? How is health spending distributed across the state?
Edit, 3/28/18: RStudio just announced Python interoperability through the reticulate package. Rmd Notebooks are unbeatable, in my opinion.
Original Post: I started using Jupyter Notebooks back when they were called IPython. I even remember having to set up a virtual Linux environment because they were not available on Windows. As much as I have enjoyed their functionality, I recently switched entirely to R Markdown in an RStudio environment. Here’s why.
Kaggle is a forum for interacting with other data scientists and competing to see who can write code that will best predict features of data. It’s a way to test your skills at statistics and machine learning, and to do a lot of human learning in the process (sorry, bad pun).
When I entered the contest to categorize crimes that occurred in San Francisco, my initial goal was to do better than random chance.
R has been the perfect language for the back end of this government data dashboard I am developing.
It has excellent packages to pipe in data from every significant source Tools like dplyr and tidyr make cleaning and munging data trivial It is ideal for automating analysis In the R script that powers my dashboard, I have everything from simple averages and frequency tables, to a complex algorithm that converts timeseries figures to Z-Scores and then selects the top 3 variables to display based on standard scores from the last 7 days.
As I did last year, I went through several of my favorite sites and curated what I consider to be the best writing on urban issues from 2015. One thing I love about planning, and that drew me to the profession in the first place, is that it encompasses many skills and areas of interest. I think that diversity is reflected in this year’s list. Caveat emptor: I use the term planning loosely.
Mayor Curtatone and I recently returned from the Smart Cities Expo in Barcelona Spain, where we unveiled a new partnership between the City of Somerville and the car manufacturer Audi. We will be testing how autonomous vehicles work in an actual urban environment.
Driverless cars predominating city streets is in the realm of what Steven Johnson calls the “adjacent possible.” Uber just made headlines by purchasing a large chunk of Carnegie Melon’s robotics department.
Somerville, MA has been fighting a war against rats for months, and now we have the data to show that it’s working: reported sightings have dropped 66% year-to-date; some of that is due to weather patterns and random fluctuation, but a Bayesian model of the data estimates that the City’s policies have reduced calls by 40%.
Three years ago, the city where I work was dealing with an onslaught of rats.
Here’s a problem governments are faced with every day: you have a limited amount of resources to maintain aging infrastructure, in this case streets. Do you spend more on crack sealing and preventive maintenance, or full depth reclamation? Which streets should you fix first?
I am not an engineer (in fact, part of the reason I am writing this post is to get feedback from engineers); but I have thought a lot about this, and I think I have a decent method for prioritizing roadway repairs that anyone could implement using the open-source program R.
When I first started as an analyst in local government, I wasted a lot of time repeating tasks that had been done dozens of times before in Excel. SomerStat, the office where I worked and later became director, is one of the oldest local government divisions dedicated to crunching data. Inspired by the CitiStat model, which itself was inspired by CompStat, the idea was to use data to improve efficiency. And yet here I was, with fairly inefficient work routines that included pulling data into spreadsheets, munging one step at a time, and then repeating it all for the next ‘stat’ meeting.
Last year, my friend pulled 34 all-nighters, surfed 37 days, swam 62, helped to raise two kids, did 12,920 push ups, worked a total of 3,008 hours as a new poli-sci professor, and tracked all of it in a spreadsheet. He averaged 8.2 hours of work per day, including weekends and holidays. As this heatmap shows, though, his hours varied a lot compared to us regular nine-to-fivers:
It’s interesting to read this chart both left to right, as an indicator of what weekdays he works hardest, and top to bottom, to see the days when he would push hard against a deadline and then give himself some time off to surf.
For my job, I read a lot of articles on urban policy and planning. I believe that the best policies are usually borrowed from other cities rather than fabricated from nothing. In that spirit, I even borrowed from other top ten lists to create this post. I like to think that my list is more comprehensive than some of the others since I have no incentive to link to my own content.
Time is a weekly news magazine that was first published in New York City in 1923. After my last post, a generous redditor offered to share with me a dataset of every person who has appeared on the cover since its first issue. It turns out he had painstakingly collected this data for a very cool website he created called hugequiz.com, where there are multiple quizzes on this subject. Fortunately, I don’t think it will spoil any of the quizzes to see this chart of the most frequently featured people:
The 57th annual Grammy nominations were recently announced, cementing Beyonce as the most nominated female artist in history. She is now tied with Kanye for a career total of 53 nominations. This news made me curious enough to plot the top award winners of all time.
I did not recognize the knight at the top of the list (turns out he is an amazing conductor with a storied history as director of the CSO); but I sort of expected that.
Coakley received a lot of votes from residents of Massachusetts’s major cities. This is evident in the maps I posted last week, and in the charts below. What may be surprising is how many votes Baker received in cities, including Boston:
Baker received nearly 10,000 more votes in Boston than he did in 2010. If those had gone to Coakley instead, the spread between them would have been cut roughly in half.
In my last post, I displayed a series of maps from the 2014 Massachusetts midterm election. In all, I created 17 maps, all with fewer than 20 lines of code. Here’s how…
The basic idea is to use a For Loop with ggmap to iterate through columns of a data frame. In my example, the code for which can be found here, I first read shapefiles from MassGIS into R, and then combine them with election data.
Most of the maps I have seen so far color each city or town either red or blue based on the majority outcome. That works fine, for the most part, but I prefer to see the range of voting patterns. These heat maps go from light yellow to dark blue. The scale changes on each one in order to show the full spectrum. I managed to automate their creation in R.
I wanted to see if it was possible to train a model to detect the difference between two fictional authors created by the same novelist based only on the frequency of common stop words, e.g., “the.” It worked: The randomForest model correctly selected Nick 93% of the time and Amy 91%.
Background When I first started using R for data analysis, I was mesmerized by all of the packages and what they made possible.
One of the most enjoyable aspects of my job is to partner with Harvard and other local universities to teach students through project-based experience. The City gets free consulting, which is ordinarily very high caliber; and the students get a chance use their skills in a real-world environment. I just received this copy of a recent magazine article on one of our oldest, and most mutually beneficial partnerships: Linda Bilmes’s Advanced Budgeting course.
I wanted to see if it was possible to train a model to detect the difference between two fictional authors created by the same novelist based only on the frequency of common stop words, e.g., the, at, is