- Daily SITREPS
- Posts
- It's Official! AI Does Help With Decision-Making
It's Official! AI Does Help With Decision-Making
AIXP #FTW 💁♂️+🤖
Good morning and Happy Thursday.
Welcome to the SITREP for March 7, 2023.
That’s right! This isn’t a SITREP: instead, it’s a good old-fashioned blog post!
(Let’s turn that poll frowns upside down…)
AI Makes You A Better Forecaster
I've written before about how AI can help us make better decisions and the idea of AIXP: artificial intelligence plus experience. (Apparently, I am so into this idea that when I type AIXP, my computer autocorrects to print "💁♂️+🤖")
My previous articles were based on my own observations and assumptions, but now that we are 12-18 months into this new AI wave, we're starting to see some deeper research in the field.
Research like this…
A recent paper, "AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy," by Philipp Schoenegger, Peter S. Park, Ezra Karger, and Philip E. Tetlock, looked at this specific question of how LLMs (large language models, e.g. ChatGPT, Perplexity, etc) can assist humans in their analysis. (Arxiv link)
The Four Questions
Actually, they looked at four specific questions or hypotheses.
Null Hypothesis 1: There is no difference in forecasting accuracy between the superforecasting (biased) LLM augmentation and the control.
Null Hypothesis 2: The effect of the superforecasting (biased) LLM augmentation on forecasting accuracy does not differ between high- and low-skilled forecasters.
Null Hypothesis 3: There is no difference in aggregate level forecasting accuracy between the superforecasting (biased) LLM augmentation and the control.
Null Hypothesis 4: There is no difference in the effect of the super forecasting (biased) LLM augmentation on forecasting accuracy between hard and easy questions.
This was a robust, credible study: these are respectable authors -- Tetlock runs the Good Judgement Project (aka the Superformecaster) -- and the study had 991 participants with all the usual controls and guardrails you'd expect in a properly conducted piece of research. Importantly, it took place in the latter half of 2023, so the models used and lessons learned should still be relevant despite the pace of innovation in the space.
TL;DR
So, before we get into the weeds, what were the big takeaways, particularly for us as risk managers?
1 - LLMs improve forecasting performance across all skill levels and problem complexities.
2 - Their performance dips when tasked with adopting adversarial perspectives or red teaming due to a tendency to revert to a mean outcome. More broadly, the prompt's 'idiosyncracies' matter less than we thought.
Even if you stop there, these are two very useful takeaways.
But let's dive deeper because there's more to learn.
The Study
First, a quick summary of the study is worthwhile to clarify some of the specifics and nuances.
The participants were split into three Groups: a Control group with access to an older LLM with minimal reasoning and no specialist prompt; a group with access to a specially prompted 'Superforcaster' version of GPT-4-Turbo; and a group using a biased version of GPT-4-Turbo prompted to be deliberately bad at forecasting. (The paper refers to these as the control, treatment, and treatment (biased) groups).
The forecasters were presented with six initial questions, such as "What will be the closing value for the Dow Jones Transportation Average on December 29, 2023?" and those identified as skilled forecasters were set three additional questions.
The skilled forecaster questions
All participants were asked to open a chat window for their respective LLM and keep it open during the test.
Of the initial 1152 participants, some failed to follow all instructions or lost interest, meaning that their forecasts were discounted, leaving 991 sets of results for analysis.
Again, this was a robust, well-designed, and well-controlled study by Serious People, so you should feel comfortable quoting this to your boss (or your skeptical coworkers).
The Findings
In addition to the TL;DR takeaways above, there is a lot to think about for risk, crisis, and security managers so let's dive in to the results, question by question. (Note that the answer to hypothesis 3 was much less clear than the others, so I'm not addressing it here.)
1: There is no difference in forecasting accuracy between the superforecasting (biased) LLM augmentation and the control.
The study rejected this hypothesis. Instead, it found:
- Utilizing a Large Language Model (LLM) enhances forecasting capabilities.
- A combination of humans and LLMs significantly outperforms the capabilities of either alone.
- This contrasts with earlier findings where models like GPT-4 were significantly outpaced by human performance and didn't always add value to the human forecast.
My Takeaways:
Incorporating LLMs into the forecasting process enhances accuracy, and these capabilities are improving rapidly. This means we can conduct the analysis part of our risk assessments more quickly, more thoroughly and probably with better results. This also supports my experience of how LLMs can augment brainstorming for things like country analysis, evacuation planning, and other higher-level thinking tasks.
2: The effect of the super forecasting (biased) LLM augmentation on forecasting accuracy does not differ between high- and low-skilled forecasters.
The study supported this hypothesis finding:
- Contrary to previous patterns observed in other domains, LLMs uniformly enhance the forecasting ability of both low and high-skilled forecasters.
My Takeaway
This is pretty significant as it contradicts previous studies, which suggested that higher-skilled workers would get less benefit from LLMs. However, that might indicate that there's a cap to how effective you can be in the domains that were studied before—writers, coders, lawyers. Essentially, there might be a perfect answer or optimum output.
However, in the case of something like forecasting, where there is no best answer at the time, only in hindsight, the fact that everyone benefits is a real bonus. But it also requires more experienced folks to be a little humble and accept that they can be better if they use an LLM in their work.
4: There is no difference in the effect of the super forecasting (biased) LLM augmentation on forecasting accuracy between hard and easy questions.
The study supported this hypothesis finding:
- No significant difference in how the LLM benefited users between easy and hard questions.
Here are examples of questions respectively rated easy and hard
Question 1: What will be the closing value for the Dow Jones Transportation Average on December 29, 2023?
Question 2: How many refugees and migrants will arrive in Europe by sea in the Mediterranean between December 1, 2023 and December 31, 2023?
My Takeaway
Again, this is different from what we might have expected before, especially as earlier studies suggested that LLMs can enhance easier tasks (low-skilled workers), but high-skilled workers didn't benefit. Instead, this study indicates that LLMs can enhance straightforward and complex tasks. And if you look at question 2, this kind of multi-faceted, 'what if' question is probably much more likely to be faced by risk managers, so this is a significant finding.
So, all in all, this study provided some great news and confirmed that LLMs could be powerful tools for complex tasks, the kinds of things we see all the time in the risk, crisis, and security space.
But...
Some Caveats
However, we need to be realistic. We can't just fire up ChatGPT, ask, 'How likely is it that Putin will Invade Estonia by the end of 2024?*' and expect to get a reasonable forecast.
(*I ask the DCDR models this kind of question all the time, so I have some experience here.)
But why not? Didn't the study just prove that this is exactly what the LLMs can do?
Not exactly.
First, the forecaster was using the LLM to assist their reasoning, not expecting it to give them the answer immediately (at least, I hope they weren't). Again, we come back to this idea of AIXP - you need the AI plus the human.
Second, the LLMs need good data, or they will give you bad answers (GIGO). ChatGPT has a training cut-off of several months, so its data is outdated. Meanwhile, Perplexity can search the web so the data is up to date, but the reasoning isn't as sound as the latest ChatGPT models (sorry, Perplexity, but you know it's true). So, by itself, the LLM can't be expected to give you an accurate forecast without quality data.
The point on ChatGPT vs. Persplexity leads to point three: each LLM is different. There are now hundreds of models, each with strengths and weaknesses; a fine-tuned version of a model will behave differently from the out-of-the-box version; and although the study indicated that they revert to the mean ("You are a helpful assistant"), the parameters of the prompt do matter. Plus -- and this is entirely based on my observations -- the models respond differently depending on the workload, so ChatGPT at 11:00ET is not the same as ChatGPT at 23:00ET. So, you need to think about the LLMs you might use and test these out.
💁♂️+🤖 #FTW!
But don't let these considerations put you off. (I'm not. After all, this is the heart of DCDR, and I am incredibly pleased with the models' results in terms of speed, accuracy, consistency, and quality.)
As we thought, and as the study proves, LLMs are hugely beneficial for analytical and forecasting tasks. But you do need to consider how you'll use them to ensure you get the best results.
So, ss my computer likes to correct me: 💁♂️+🤖 #FTW!
See you tomorrow with the SITREPs, but I hope you found this useful. (I admit, it was nice to write a blog post again after a break).
~Andrew
🤔 Like it? Let me know….
😭 Hate it? Let me know…
😻 Love it? Let someone else know about the SITREPs…
👇