Assessing DCDR's Accuracy

How I'm Tracking Performance

This might feel a little heavy for a Sunday email but I wanted to share this note on DCDR’s accuracy ASAP. (You can skip to the end to see the results if you don’t want to wade through the maths.)

And if you are very impatient, the bottom line is that the baseline model outperforms random guesses without returning any wholly inaccurate results, which is a solid start.

Why We’re Tracking DCDR’s Accuracy

The accuracy of DCDR's stability assessments is very important, primarily to the user but also to the project's viability. After all, if the assessments aren't better than random guesswork, we aren't adding any value. 

So, I want a way to assess and share the accuracy of the output.

However, as obvious as sharing the accuracy of your assessment might sound, it's not something we see a lot. 

This makes sense when you think about it: forecasting is hard, and the chances of you getting a straight-A report are pretty low. So, the smart thing to do is not to publicize how accurate your results are. Or, if you do, you only share those instances where you got it right.

But that means the assessment provider doesn't have much skin in the game, making it too easy to brush over mistakes.

So, for better or worse, I've added an evaluation scores module to DCDR. This reviews an assessment at the end of the assessment period and compares the result to the forecast. That way, I can track the effectiveness of the models over time, and users can have confidence in the products they are using.

The Brier System for Scoring Forecasts

But how do you score these kinds of things? After all, there are three possible outcomes:

  1. The results matched the forecast: wholly accurate forecast

  2. The situation didn't change when it was supposed to: inaccurate forecast

  3. The situation went in the opposite direction: wholly inaccurate forecast (and misleading)

Next, we need a way to score these results where an opposite result -- which could have serious consequences -- is exaggerated and 'punishes' the overall score.

Luckily, Glenn W. Brier has already done the hard work here by developing the Brier Score to assess the effectiveness of things like weather predictions.  

Wikipedia describes this system as:

"...a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied to predicted probabilities."

But to put it another way, the score takes the likelihood you ascribe to each outcome and then produces a value for the actual outcome relative to your confidence or likelihood level. But, because of the mean square function, the value of a confident right answer is exaggerated, as is the value of a confident wrong answer. Hedged -- e.g., 33/33/33 answers -- don't 'reward' correct answers as much as this is getting close to random chance (which we will come to in a moment as this is important).

Importantly

"the lower the Brier score is for a set of predictions, the better the predictions are calibrated" (Wikipedia)

So, an exact match is 0, and a completely wrong answer is 1 or 2, depending on how you use this system.

This is how the SuperForecasters at the Good Judgement project track their accuracy, so it's a well-recognized approach. (See point 4 here for more.) 

Please sign up to keep reading

Please consider subscribed to Daily Research. That way, the good stuff comes to you direct.

Already a subscriber?Sign In.Not now