On Fri, 22 Feb 2002, Dean Oliver wrote:
> I will note that points tended to be positively skewed, whereas minutes
> were negatively skewed (because these 2 stats are essentially capped on
> opposite ends). I think there is a way of dealing with right-censored data
> (where values are capped at the high end) called probit analysis. I don't
> know much about it.
Probit by itself probably won't help us much. It's most often used when
the dependent variable is binary, e.g. does a team win, or does a team
from the West win the NBA championship, or does college player x get
drafted. It could be used in those circumstances, but those do not
address the problem of data which are skewed due to censoring.
Probit models can be combined with linear regression models to create
what are called Tobit models (after James Tobin, the Nobel-prize winning
economist from Yale). This could have some applicability but I don't
think it will add much to NBA statistics. Classic examples include the
demand for automobiles: some households have 0; others own n cars. So
there's the censoring (lefthanded in this case): you want a model which
will predict n, but that model should never predict negative n, which is
physically impossible. Hence the Tobit model: does the family own a car,
yes or no (that's the probit part), if so, what is n (that's the linear
regression part). Negative n predictions are automatically converted into
predictions of 0 cars owned.
Often players DO play 0 minutes or score 0 points. But for starting
players, that's almost certainly going to be due to events such as
injuries and suspensions, events which are not going to be particularly
well correlated or predicted by our traditional basketball statistics.
So we could try to create Tobit models which include predictions of the
probability that Player Y will score more than 0 points, and if so,
how many. But those models will lack good predictor variables for the
score/no score probability (except maybe something like age or past injury
history) and so will probably not help us much.
Bench players might be different -- they are more likely to score 0 points
or get 0 minutes, and for reasons which are related to their playing
ability, rather than random events such as injuries. So maybe there are
Tobit possibilities there, although off the top of my head I don't
immediately see a useful application.
> Look at the distributions. Let me know if you see anything.
The obvious one of course is the heavily skewed distributions of bench
players/marginal players such as Foster. But we knew that already.
To me the interesting one is Karl Malone vs Reggie Miller, two players
that we'd listed as possible opposites. It's hard to measure stuff from
the diagrams, here's some eyeball estimates of their percentile scoring
stats (not the "01" graphs but the non-01 ones, which I assume are career
Percentile Malone Miller
10% 15 10
25% 21 15
50% 25 19
75% 29 23
90% 33 30
It may or not be statistically significant, but it's at that 90% level
where I see Reggie starting to narrow the gap with Malone: Reggie's got a
10% probability of getting 30 or more; Malone a 10% chance of 33 or more.
Advantage Malone, but a smaller one than at the lower levels, where when
we look at the percentiles, Malone's always 5-6 points ahead of Reggie.
But not at the 90%ile level.
I.e., overall, the Mailman scores more than Reggie Miller. But when it
comes to really big games, 30 points or so, the gap is smaller. Reggie is
relatively more likely to erupt for 30 or more, given his lower overall
average. (Again, it is far from clear that these results are
That was essentially a version of a skew or third-moment measure. How
about ordinary dispersion? It's hard to estimate standard deviations from
the graphs, but the interquartile ranges are simply the 25th and 75th
percentiles: 21-29 or 8 points for Karl, and 15-23 or 8 points for
Same degree of dispersion as measured by interquartile range. BUT: as a
couple of us mentioned, given Malone's higher median and mean, one could
make a good argument that Reggie has a higher relative dispersion. E.g.
his coefficient of variation is almost certainly higher (which, granted we
expect from the results that Manley was apparently getting years ago. But
if he has about the same standard deviation as Malone, yet a substantially
lower mean, then under just about anybody's definition we would call that
higher relative dispersion.)
It may be my imagination, but I do think that Karl Malone's graph shows
negative skew and Reggie Miller's shows positive skew, if so this is in
line with what we've been claiming about the "reliable starters" vs the
"occasionally hot handed streak shooters".
Those are my reactions.