Empirical Bayes to Estimate NBA Treys

All sports statistics are imperfect measures of a player’s performance. At best, they show relative differences in how well athletes shoot, steal, and rebound. At worst, they are marred by rule changes and outside factors that either exaggerate or handicap certain subgroups of players.

The NBA’s measure for three-point accuracy is one of worst kinds of the latter. Between 1995 and 1997, the NBA changed the distance of the three-point line, biasing all future comparisons between generations of players. During those years, it was as though we changed the distance of marathons to 21 miles and kept all of the fastest records.

What follows is an attemt to use a method called empirical Bayes estimation to rank the most accurate three-point shooters of all time. What makes this different, however, is that I base my estimation on all of the available data except for the seasons where the line was changed.

basic <- threes %>% 
  group_by(player_id) %>% 
  summarise(threes_made = sum(threes_made), threes_attempted = sum(threes_attempted),
            Player = Player[1]) %>% 
  mutate(three_pct = threes_made / threes_attempted) %>% 
  add_ebb_estimate(threes_made, threes_attempted, prior_subset = threes_made > 50)

basic %>% 
  ggplot(aes(.raw, .fitted, color = threes_attempted)) +
  geom_point()
## Warning: Removed 387 rows containing missing values (geom_point).

basic %>%
  arrange(-.fitted) %>% 
  top_n(20, wt = .fitted) %>% 
  mutate(rank = row_number()) %>% 
  rename('Measured rate' = three_pct, 'Empirical Bayes estimate' = .fitted) %>% 
  gather(type, rate, `Measured rate`, `Empirical Bayes estimate`) %>%
  ggplot(aes(rate, reorder(Player, -rank), color = type)) +
  geom_errorbarh(aes(xmin = .low, xmax = .high), color = "gray50") +
  geom_point(size = 3) +
  labs(x = "Three Point %",
       y = NULL, title = "Measured Rates, Empirical Bayesian Estimates, and Credible Intervals") +  theme_minimal() +
  theme(legend.title=element_blank())

no_line <- threes %>% 
  mutate(line_change = case_when(season >= 1995 & season <= 1997 ~ "during",
                                 season < 1995 ~ "before",
                                 TRUE ~ "after")) %>% 
  filter(line_change != "during") %>% 
  group_by(player_id) %>% 
  summarise(threes_made = sum(threes_made), threes_attempted = sum(threes_attempted),
            Player = Player[1], season = round(mean(season))) %>% 
  mutate(three_pct = threes_made / threes_attempted) %>% 
  filter(threes_made >= 7) %>% 
  add_ebb_estimate(threes_made, threes_attempted, method = "gamlss", 
                   mu_predictors = ~ season + log10(threes_attempted)) 

no_line %>% 
  ggplot(aes(.raw, .fitted, color = threes_attempted)) +
  geom_point()

no_line %>%
  arrange(-.fitted) %>% 
  top_n(20, wt = .fitted) %>% 
  mutate(rank = row_number()) %>% 
  rename('Measured rate' = three_pct, 'Empirical Bayes estimate' = .fitted) %>% 
  gather(type, rate, `Measured rate`, `Empirical Bayes estimate`) %>%
  ggplot(aes(rate, reorder(Player, -rank), color = type)) +
  geom_errorbarh(aes(xmin = .low, xmax = .high), color = "gray50") +
  geom_point(size = 3) +
  labs(x = "Three Point %",
       y = NULL, title = "Measured Rates, Empirical Bayesian Estimates, and Credible Intervals") +  theme_minimal() +
  theme(legend.title=element_blank())

Related