AI and Forecasting
There are many areas where AI might appear to do great, but on deeper inspection it's actually awful. Forecasting is one of those areas and this is that deeper inspection.
With the enormous increase in the power of AI (specifically LLMs) people are using them for all sorts of things, hoping to find areas where they’re better, or at least cheaper than humans. FiveThirtyNine (get it?1) is one such attempt, and they claim that AI can do forecasting better than humans.
Scott Alexander, of Astral Codex Ten, reviewed the service and concluded that they still have a long way to go. I have no doubt that this is the case, but one can imagine that this will not always be the case. What then? My assertion would be that at the point when AI forecasting does “work” (should that ever happen) it will make the problems of superforecasting even worse.2
I- The problems of superforecasting
What are the problems of superforecasting? Well I’m working on a book length treatment of the subject, and I’ve also written extensively on the topic in previous posts, but mostly it comes down to the fact that superforecasting is a poor tool for dealing with black swans (rare, high impact events). Given that reality is constructed out of these high impact events (despite their rarity) an over-reliance on superforecasting pulls us away from preparing for these events towards increasing the accuracy of predictions with limited importance. Superforecasting excels at making predictions within the range of normal outcomes (the middle of the bell curve). Inside of this zone expert forecasting of all sorts is better than pundits and normal people. Outside of this zone things are less clear, but arguably worse.
For the purposes of this post let’s focus on four relevant aspects of superforecasting, and then examine how AI interacts with these elements:
First, the methodology of superforecasting needs data to work. Superforecasters achieve their results by considering a large body of past events, and making models out of that history. The author of Superforecasting, Philip Tetlock, calls this the “goldilocks zone”. The methodology works best when it deals with things that aren’t obvious, but also that aren’t obscure and unprecedented either.
Second, there’s an incentive towards accuracy. This may seem unobjectionable, but it falls prey to Goodhart’s law. The idea that when you build a process around a specific metric, that metric starts to distort the process. In the case of superforecasting it pulls the entire endeavor towards predictions where accuracy is easy to determine.
Third, there’s an incentive towards quantity. As with any endeavor, the true goal is prestige and money, and more predictions increases the potential for both of those.
Finally, there’s the problem of laziness. Most actual problems are difficult and multifaceted. Superforecasting forces one to reduce things down to a question which can be evaluated as being either true or false at some future point. These are useful points of data for making a decision, but they often end up being used as the decision. The prediction is treated as a recommendation, and its limitations are ignored.
My contention would be that AI superforecasting would exacerbate each and everyone of these issues.
II- Methodology
I’ve mentioned that superforecasting works best in the goldilocks zone, that area where things are neither too obvious nor too obscure. Those who tout the potential of AI superforecasting imagine that they will be able to push the borders in the direction of the obscure. That they will be able to accurately offer predictions on events outside the reach of a human superforecaster. Mostly they will do this by having access and awareness of far more data than a human superforecaster could ever hope to assemble. This sounds plausible, but is it?
I see two problems that could arise. First people have noticed that LLMs end up being basically the average of all of the data they’ve ingested. Which is not to say that they necessarily give average responses, more that they cluster around common responses. They seek the mode rather than the mean. In which case this would be another effect pulling things back towards the middle of the bell curve—a problem superforecasting already suffers from. To be fair it’s often the bell curve of expert consensus, but what we really need out of superforecasting is not one more expert saying the same thing, but something that can tell us when the experts are all wrong.
To offer up an example that’s still pretty contentious: youth gender medicine. With the release of the Cass Report in the UK, lots of people are coming around to the idea that any form of transitioning while people are still children is a very bad idea. (See for example this recent post by Andrew Sullivan.) Do you think there’s any chance that an AI superforecaster would have seen that coming, that it would have boldly predicted in 2020 that all of these attitudes, which seemed so entrenched, would end up being overturned? If anything we’ve seen the opposite, where, because of worries about AI bias, AIs have been more politically correct than the average person.
What happens if we try to tune the AI to go in the opposite direction? If we try to get it to avoid expert consensus? I think we end up with our second potential problem: AI hallucinations. Rather than getting an AI superforecaster that’s a bold iconoclast that sees things no human forecaster can, I think we’re far more likely to end up with an AI that just makes stuff up, proclaiming very confidently that Israel is going to nuke France. Perhaps we’ll figure out how to banish these hallucinations, though to the extent that we can, it’s not some switch, it consists of training them to be well-behaved within certain bounds which just takes us back to the problem of consensus.
Of the four areas I listed, this one is probably the most tractable, nevertheless I think it’s going to present more difficulties than people think.
III- Accuracy
This one gets a little bit more esoteric, but it starts with the question of how do you train this AI? Perhaps someday (I don’t think it will be soon but some people do) we’ll have an artificial superintelligence, and it won’t have any of these problems (though it may have a host of other problems). But for now our AIs are all large language models, and these require training and tuning. The chief quality of a superforecaster—the quality we’ve been optimizing for since Tetlock discovered how often “experts” are wrong—is accuracy. Consequently we want to train our AI superforecaster to be accurate. So far so good, what’s wrong with that. The question we return to is how?
We don’t have a lot of time, no one is going to wait five years, in fact it’s probable that no one is going to wait five months. Not only are there competitors out there who are going to beat you to market if you wait, but also it’s clear that while we’re still deep in AI hype that no one really cares. This leads to a bias in the training regime. The AI is going to be rewarded for predicting short term things where accuracy is easy to determine. Which is another way of saying that it’s going to be rewarded for predicting the status quo. This is a subtle point, which I’ve expended on more elsewhere. But I’ll try to make it succinct. Any attempt to quickly train a superforecasting AI to be accurate is going to end up biasing it towards predictions about events in the near future where its confidence level is high, which once again biases us towards the middle of the bell curve. Or as I said above: Goodhart’s law.
IV- Quantity
One of the chief benefits of an AI superforecaster as well as one of the big drawbacks is that it will be very easy to ask it for a forecast. A normal superforecaster may spend hours or even days on a forecast, an AI superforecaster should be able to spit one out in the same time it takes to get a response from ChatGPT. This will inevitably lead to lots of forecasts being created, from this point one of two things will happen, either this quantity will reveal the AIs limitations and we’ll decide it doesn’t actually work. Or it will work great on a large majority of the questions. Let’s say it’s the latter, what’s wrong with that?
I would contend that, should this be the case, we will end up with an exaggerated view of its utility. It will be impossible to resist its vast number of correct predictions, and entirely natural to ignore the outsized impact of the few predictions it got wrong. Precisely because the quantity of the predictions will overwhelm any discussion of the impact. The problem is the vast majority of the predictions we might ask this AI to make don’t matter, not really. But if it ends up being incredibly well calibrated on all these meaningless questions, then, when it misses the really important questions, we’ll assume that was just a fluke.
To put it another way, I used an example in a previous post of someone who was incredibly well calibrated on all of his predictions, but ended up being wrong about Trump and Brexit. In the end those misses represented only two predictions out of a hundred, but together they had far more effect on the world than the 98 questions he got right. With an AI superforecaster I think you’ll get the same thing, except it will be two predictions out of ten thousand. Which will lead to the strong temptation to entirely dismiss those two. After all, look at the other 9,998 predictions!
V- Laziness
All of this takes us to the issue of laziness. Preparing for the future is difficult. It’s particularly difficult to prepare for black swans, like the pandemic. The chief utility of an AI superforecaster would be if it could do better than a human superforecaster at predicting these black swans, but so far all of its qualities have pulled us away from the extremes where those events are located. As such, there’s good reason to believe that they won’t be any better at predicting extreme events, and may, in fact, end up being worse. But they may end up being very easy to query, while also possessing a reputation for being very accurate. I have argued that this accuracy will be skewed, but certainly its proponents won’t advertise this limitation.
As a result of this I fear that we’ll end up in situations where CEOs and politicians are far more likely to make decisions based on the five minutes they spent asking the AI superforecaster their company pays for than they are to spend hours or days really grappling with a problem. Particularly if the former is sufficient to cover their you-know what (CYA).
VI- Conclusion
Many people have accused me of worrying too much about the potential problems of prediction proliferation. And they may have a point, but the future is going to be full of strange situations, difficult decisions, and unknowable black swans. There will be no safety in normalcy; no refuge in the status quo. Potential AI superforecasting, rather than helping us confront these difficulties, seems poised, rather, to empower our worst instincts—encouraging laziness, choosing quantity over quality, and reifying the status quo even further.
In closing, I ask you to consider the example of school closures during the pandemic. These closures are now widely seen to have gone on too long. What if we had had access to an AI superforecaster to help us decide when children should return to school and under what conditions? Would it have helped? Would we have opened the schools sooner?
It’s hard to imagine what true-false question we would have asked our AI forecaster. Or what data it would have used to make its forecast. And even if we solved those problems it’s hard to imagine that it would have boldly urged us to open the schools without delay.
On the other hand, it’s easy to imagine it telling us what we wanted to hear, and responding in a fashion that confirmed the biases we already had. And, as we saw, in most cases that consisted of keeping the schools closed. And moreover, the AI superforecasting would have given us another avenue of support for that policy. “Well, our expert AI system agrees with our decision to keep the schools closed.” This would have given us one more way of avoiding a very difficult decision. A decision that could only be made by considering a wide variety of factors. A decision that could not be reduced to a simple question, where the accuracy could easily be judged.
That decision, like so many decisions we will have to confront, was always going to be messy. In this area, as in so many, AI promises to clean up the mess. I don’t think it can.
At the moment I’ve got my own mess I’m trying to whip into shape. I think they call it a book? I intend to eventually adapt this into one of the chapters. Feedback is welcome, though it’s expected that the reader of the book will have developed a background with the issues that might make this go down easier. Still asking “Where the heck are you getting assertion X?” or “I don’t understand Y.” Is always useful.
FiveThirtyEight is a well known election prediction site, which was started by Nate Silver (watch this space for a review of his latest book in the next week or so) He has since left the site, and has his own substack. FiveThirtyEight came from the number of electoral college votes available in each presidential election.
I’m using “superforecasting” as an umbrella term to cover all forms of expert forecasting which have followed and been inspired by Philip Tetlock’s books: Expert Political Judgement and Superforecasting. This includes people following his methodology, as well as prediction markets and any system where accuracy is rigorously defined and carefully tracked.
Two totally different things I think may come together to provide some insight here.
Years ago some neonazis online started putting the names of Jewish people in (((multiple parentheses))). Called an 'echo', the idea would be during this normal period they would casually retweet or comment on the work of journalists or writers but add the little flag if they were Jewish. At some point in the future when their revolt or seizure of power would happen, they could Google the multiple marks to instantly get a list of 'known Jews'.
Also years ago George Soros wrote multiple books pushing his idea of Reflexivity. In a nutshell it is that predictions alter the underlying reality. His prediction machine was not an AI or LLM but the financial markets....which is kind of the whole idea of AI since even the largest players in the markets are only drops in the ocean.
His idea never really caught on, like a lot of people who try to create a new theory by pushing it as a popular audience book, it wasn't clear that his work could both be really unique and stand up to falsification (Among other such failures IMO are Jordan Peterson, Stephen Pinker, Jared Diamond but in this modern sort attention span age you don't even need a book, Eric Weinstein has had a 'theory of everything' for years now no one is allowed to see).
What is interesting about these two roads is:
1. People attempting to exploit an algorithm or type of AI for their own ends. That would be the neonazis who assume Google would just continue to run as always after they get power as if it was some NPC in a video game*.
2. The 'prediction engine’ altering the outcomes of the reality it is trying to predict. In the case of bubbles, this is hard to ignore. Your bank views your capacity to borrow by your home value plus your credit score in most cases. Rising home prices mean you become more worthy of lending. This alters the reality around you. The lumber stores in your area are more likely to increase in value if you're more able to build that mega deck you've always dreamed about. Or perhaps crypto is more valuable if your inclination is to borrow and play the markets. You can be a very prudent person and avoid borrowing even though you could walk out of your bank with a check for $50K anytime you want in less than a hour. But will all your neighbors also follow?
This means a system of 'super forecasters' who are either people or AI that we listen to will also become a group that will be gamed. It also means their predictions will generate feedback loops that alter reality. That may sound mystical, but it isn't, almost all the predictions you are interested in are essentially human behavior and since you want to get these forecasts, you are proposing humans alter their behavior by tapping into them...that's fundamentally different than a group of super forecasters making predictions about what new particles will come from a new collider versus different lengths of school closures or tax policies.
Perhaps you need to consider adding a bit of Asimov to this idea. Remember in his Foundation Series the ‘super forecasting’ math required that the subject population could not be told of the predictions. In fact, I think they weren’t even allowed to know someone was out there making predictions about them.
* The 'echo' game got gamed on its own. People countered by putting their own names in parenthesis to show solidarity, others put the nazis' names in so they would come up as well on any 'purge list'. Google almost certainly took note of the attempt to tweak search results and acted as well. A super forecaster who was charged with making predictions about what would happen if Nazis got power again would have to know about the echo trend. But if he did why wouldn't others also know about it and their efforts to counter it would frustrate an ability to predict that aspect of the question.