Estimating Baseball Event Probabilities With log5
December 03, 2020
In his 1981 and 1983 Baseball Abstracts, pioneering sabermetrician Bill James proposed the log5 method for mixing event probabilities, which is similar to metrics used in other fields. Here are two motivating scenarios:
- Team A has winning percentage . Team B has winning percentage . What is the expected winning percentage of Team A against Team B?
- A pitcher strikes out 20% of batters he faces. A batter strikes out 10% of the time. What is the expected strikeout rate in this matchup?
Let’s start with the Team A vs. Team B scenario. We can derive the log5 formula easily. First, we observe that the winning percentages are essentially probabilities:
For convenience, we also can define the complement of each:
Obviously, there are only two outcomes:
- A wins & B loses ( and occur together)
- B wins & A loses ( and occur together)
Next we ask, what’s the chance that A wins a game and B loses a game? That would be
And what’s the chance that A loses a game and B also wins a game? Simply
This essentially tells us, given that A and B both play a game on the same day not necessarily against each other, what is the chance of them achieving opposite outcomes? For example, if A wins 55% of games, and B wins 60% of games, the probability they play independent games on the same day, and A wins, but B loses is:
On the flip side, the chance that they play independent games on the same day, and A loses, but B wins is:
Here’s the twist. If we know that A is playing B on this day, we know the events are not independent. Specifically, we know that these are the only two possible outcomes. A and B can’t both win or both lose. To make this a valid probability distribution then, the two outcomes must add up to one, which we can guarantee by dividing each probability by their sum. So for A we could write:
Continuing the example, the probability of A winning is . Of course, we can substitute in our previous definitions to get
I find this a more understandable formulation of the core log5 idea. You could come across the equivalent expression on some websites:
but this is make no intuitive sense when looking at it.
Pitcher vs. Batter Matchups
But this doesn’t always make sense, especially for pitcher and hitter matchups. We’ll look at strikeout rates. Notice that in this case, the stats we have for pitchers and batters refer to the same event, so we don’t have to invert one. A basic implementation would be
What if the pitcher’s strikeout rate is 20% and the batter’s is 15%?
Hmm. That doesn’t seem right. To find out why, think about what information the stats are actually telling us. The pitcher’s strikeout rate says, that on average, he strikes out 20% of batters. “On average,” means that we assume if he faced an average batter, the strikeout rate would be 20%. Similarly, the batter’s strikeout rate of 15% means that, if facing an average pitcher, the batter would strikeout 15% of the time. Thus, what we really need to do is normalize the stats by the league average (which represents both the average pitcher and average batter). Let’s see how this changes our equation:
This equation is unwieldy. For convenience, let , , . Then we have:
Notice that we only divide by , not , because we only need to normalize one of the rates (pitcher or batter). To see how this works, assume the league average rate is 15%, the same as the batter’s. Then we get:
Which is exactly what we expect! The pitcher strikes out 20% of average batters. The batter, with a rate of 15%, is average. Thus, we should get a strikeout rate of 20% for the matchup. What if the batter is better than league average, with a rate of 10%?
A rate of 13.6% once again lines up with expectations. The batter is better than average, so the pitcher’s strikeout rate is brought down. Alternatively, the pitcher’s strikeout rate is higher than average, so the batter’s strikeout rate is brought up.
Why Did We Ignore League Averages for Teams?
“Hold on a second,” you may be thinking. “Why didn’t we need league averages for the Team A vs. Team B scenario?” In that situation, we would use the league average winning percentage, which is 0.5. (Think about it. In the whole season, across all teams, the number of wins and losses will be equal. Every win has a corresponding loss by the opponent.) Plugging into the equation, we have:
which is the same as our previous equation for the team vs. team matchup. Notice that this hinges on the fact that when .
We can rework the equation a little more, with some gratuitous algebra, to get something called an odds ratio:
This conveniently splits our calculation up into independent terms for pitcher and (normalized) batter. We can use this to easily chain new factors. For example, if a pitcher has a homerun rate of 2%, and the batter has a league average homerun rate (roughly 3.5%, according to baseball-reference.com), then typically the matchup rate will be 2%. However, if the ballpark is Coors Field, which has a homerun factor of 1.147 (from ESPN), which means the odds are 1.147 HR to 1 HR in the average ballpark. Combining all this gives:
which corresponds to a rate of 2.3%. Thus, the ballpark causes the pitcher to perform slightly worse than usual against the average batter.
Thus, thanks to Bill James and log5 you can evaluate specific pitcher vs. batter matchups and even include park factors or other environmental conditions. This method is restricted to binary events, but by creating a binary tree, it is possible to simulate all the outcomes of an at bat.