The Kalman Filter
Let’s say you’ve been going on more runs recently, and you want to track your mile pace. Obviously, you can find an average mile pace for each run by dividing the distance by your time. However, this could be noisy and unstable—on some days you might be sluggish while on others the perfect confluence of factors might set you up for a personal best. An alternative would be to average all your recorded mile times or do a running average of the last five. This would be better, but if you are steadily improving, the better times of your recent runs could be dragged down in the average by older, worse times. In general, the “hidden” information we want—your “true” mile time at this instant—is obscured by a variety of factors so we can’t measure it directly.
A solution might be to blend two sources of information. The mile times actually measured are noisy and variable, but they do contain some information about the true mile time that we want. On the other hand, we also know that it should improve slightly with each run. We’ll call the first source the “observation” and the second the “model.” We can cross-check the expected true mile time according to the model and the observed mile time to produce a better estimate of the true, hidden value.
How do we combine them? If the model’s output and the observation are different, how do we know which one to believe or in what proportion? The critical piece for solving this puzzle is capturing the uncertainty or variability of each source. If the model is more confident, we’ll rely more on its prediction. If the measurement is more confident, we’ll be more willing to revise the model’s prediction to consider the measured mile time.
In most cases, the uncertainty surrounding our measured value is relatively constant. We might know that the measured mile time often swings up or down by one minute. The uncertainty around the model’s prediction is a little trickier because it involves two things. First, while the model might expect the true mile time to improve by 2% every run, there could be some real variability in this improvement. That is, the true mile time itself could have some variance intrinsically. In the observation, the variation is due to factors related to how we measure the value—temperature, rain, energy level, etc. For the model, we are implying that the actual hidden value also has some randomness due to fluctuations in your biological processes. Two identical runs in identical conditions could affect your true mile time differently. While the model can’t predict this “process variance,” we can guess its magnitude.
However, a second factor also influences the model’s overall variance: the confidence in the value fed into the model! Since we don’t ever get the true mile time, this input value is our previous estimate of the true mile time. Even if we are very confident initially, using the model to recursively predict future values intuitively decreases the confidence in each successive estimate. If we continue to churn along into the future without comparing to real measurements, our confidence in the model’s output will continue to decline since we get farther and farther removed from any grounding in reality.
With all this in mind, we can create our filtering algorithm. We start with an initial guess about the true, hidden mile time. After each run, we use our model to predict the new mile time by decreasing the old time by 2%. We then update our confidence in the model. Next we update our estimate by adding “measurement difference” term:
updated estimate = [model's predicted value] + gain × (
[what we see] - [what we think we'd see based on the predicted value]
)
This “gain” or weight of the correction term, is what captures how much we believe the observation. Bigger gain? Add in more of the measurement. Smaller gain? Ignore the measurement. With this we can link our uncertainty discussion to the gain. Big model uncertainty? Use a big gain to pull the estimate closer to the measurement. Low model uncertainty? Use a small gain to stay closer to the model’s prediction. The tricky part is we need to scale by the measurement uncertainty. If we have a big model uncertainty, but the measurement uncertainty is still way bigger, we still want a small gain that puts more trust in the model. The gain ends up looking like this:
[uncertainty of predicted value]
gain = --------------------------------------
[uncertainty of what we think we'd see]
[uncertainty of predicted value]
= ------------------------------------------------------------
[uncertainty of predicted value] + [measurement uncertainty]
This gain also tells us how to mix the uncertainties to get the final, combined uncertainty of our estimate. Bigger gain? We are closer to the measurement’s confidence. Smaller gain? We are closer to the confidence in the model’s output. With our updated estimate and corresponding uncertainty in hand, we are ready to repeat the process after the next run.
For a more visual explanation let’s see how things might look without using the observations and just relying on the model’s predictions:
At each step, the black is the “hidden” true value, the blue is the observed time, and the green is our estimate. Initially we are very confident in our estimate, but over time the uncertainty continues to grow and grow, represented by the widening of the green region.
Now let’s see the same problem but using the gain to optimally mix the model and measurement:
The blue is still the measurement, but now we have a red curve representing the model’s predictions separate from the final estimate. The green curve is now the result of mixing the blue and red information. Although the uncertainty in the green estimate grows initially, it eventually stabilizes as the algorithm finds the balancing point between the observation and model. Any more lean towards the observation, and the model would just end up widening the green region of uncertainty. Moving towards the observation would increase uncertainty thanks to the larger uncertainty of the measurement (blue region). Moving towards the model also increases uncertainty since we ignore the useful information provided by the measurement. By plotting the gain at each step, we can see it converge to this balancing point:
Since we start out being very confident in our model, the gain starts small—biasing the estimate towards the model’s prediction—until the uncertainty grows just enough to justify leaning more on the measurement. Then the gain flattens out.
Next lets dive into how we implement all this.
Implementing a Kalman Filter
In its basic form, here’s how we apply the Kalman filter to the mile time problem.
- First we define the model as . This means that our model output is a 2% decrease from the previously estimated time. We also define the process variance as and the measurement variance as . This means natural variation in the process is smaller than the variation experienced by the measurement. We don’t need to specifically define a separate measurement equation because we assume the mile time—the state—is measured directly. Also, since we know our mile time can’t decay all the way to zero, we treat the state as the mile time above seven minutes.
- We pick the initial estimate to be or a mile time of 5 + 7 = 12 minutes. We assume this is exactly right, so the initial estimate of the variance/squared error is just .
- Whenever an observation like arrives, we start by calculating the expected time using our model and previous estimate:
- Next, we figure out the uncertainty/variance of this model output:
- At this point we can calculate the gain:
- Combining the model output and actual observation, we get: . Adding this to our minimum time of seven minutes we get 5.01 + 7 = 12.01 minutes.
- Finally, we update our estimate of the error: . Notice how this error is lower than both the model and the measurement individually!
We continue to repeat steps 3 to 7 to filter the whole set of observations:
A notebook code example can be found here: kalman.ipynb
If you can bear with me for a jump from single values to matrices, we can write the general algorithm as:
- Define constants for the state transition , process covariance , measurement matrix , and measurement covariance using knowledge of the system.
- Set and as the initial estimate of the state and state covariance (squared error).
- Calculate the new uncorrected estimate:
- Calculate the covariance of the uncorrected estimate:
- Calculate the gain:
- Collect measurement
- Correct the estimate using the measurement information:
- Calculate the variance of the corrected estimate:
- Repeat steps 3-8 for each time step
Of course, you may be wondering where these equations come from. How do we know these are the right steps? Time for a little lot of math…
Deriving the Kalman Filter
This derivation is heavily derived from Tutorial: The Kalman Filter. I’m not smart enough to figure it out myself.
We start by defining a model that describes how the state/system changes from one step to the next:
where is the previous state, is called the state transition matrix, and is the white noise associated with the process. Since the state itself may be hidden, we can write a measurement equation that links the state to a sensor value we can actually observe:
where is the measurement, is the observation matrix, and is the measurement white noise.
Our overall goal is to find a way to estimate the true state that minimizes the error. We define the error as:
where is our estimate of the state. If we ignore the white noise, we can use the state transition model to guess the state based on our previous estimate:
We can think use our sensor measurements to “correct” this estimate:
where is a parameter that controls the correction based on the difference between the actual and observed measurements. Notice that in the last step we swapped the observation with our model of the observation from Eq. (2).
Ultimately, we want to minimize the expected value of the squared error between our final estimate and the true state as shown in Eq. (3).
This isn’t arbitrary—minimizing the squared error is equivalent to maximizing the Gaussian likelihood of the model. Of course, this assumes that the state follows a Gaussian/normal distribution.
It turns out that then mean squared error is the trace (sum of diagonal elements) of the error covariance matrix:
where we label the error covariance matrix as . Let’s start by calculating this error covariance:
Substituting our definition of the estimate from Eq. (5) we get:
This is a mess. However, we can make a very helpful observation by looking at one of the factors (both are the same, but one is transposed):
The matrices are deterministic constants for a given step , as is the initial prediction . On the otther hand, both and are both random variables as defined in Eqs. (1) and (2). The properties of a variance/covariance matrix (see Covariance matrix) tell us that
If we let and , we can split up the variance/covariance of the error into
Since the noise causing variation in the two random variables is white noise, we can assume that the two noise values and are not correlated with each other. This means that , which simplifies our expression to
Inserting our definitions we then get
We can distribute the transpose and then remove the constants from the expectations:
If we define the measurement noise variance as
and the variance of the initial error as
we can further simplify to get
Returning to our goal, we want to minimize the trace of this matrix. We can make things a little easier by first expanding our equation for :
Applying the trace with properties shown in Trace (linear algebra), we get
We can minimize with respect to our free parameter by finding the critical point where the derivative of the trace with respect is zero (see Properties of the Trace and Matrix Derivatives):
Rearranging this equation gives us:
where the last step comes from knowing that is symmetric. Roughly speaking, the final line of Eq. (10) shows that shrinks as the measurement variance grows. Intuitively, this means that as the uncertainty around the measurement grows, we down weight the measurement-based correction factor in Eq. (5). However, if —the uncertainty around our initial guess —grows, so will . This means that as we become less sure about our guess, we lean more heavily on the information from the measurement in Eq. (5).
We still need one more piece of information. What actually is ? Let’s return to its definition:
Substituting our definition of from Eq. (1) and from Eq. (4) we see
Swapping in the definitions of variance we get
Notice that from our definition in Eq. (9). If we let be the variance of the process noise from Eq. (1), we can write
Phew, we made it! If you go back to the end of the last section, you’ll find all these equations form the steps of the matrix-based Kalman algorithm!