
The Case Against Superforecasting
The keystone chapter (Chapter Zero) of the book I'm working on.
Chapter Zero - The Case in a Blog Post
I- The Origins of Superforecasting
In the early-80s Phillip Tetlock — then a professor of political science at Berkeley — decided to conduct a study on the accuracy of experts. Specifically people who made their living “commenting or offering advice on political and economic trends”. The study lasted for around twenty years and involved 284 people.1
Over the course of that time Tetlock collected 82,361 predictions, and after comparing those predictions to what actually unfolded he found:
The better known the expert, the less reliable they were likely to be.
The accuracy of the experts was inversely related to the confidence they assigned their predictions, and after a certain point their knowledge as well. (More actual knowledge about, say, Iran led them to make worse predictions about Iran than people who had less knowledge.)
Experts did no better at predicting than the average newspaper reader.
Experts were largely rewarded for making bold and sensational predictions, rather than making predictions which later turned out to be true.2
In 2005 he published these findings as a book, Expert Political Judgement: How Good Is It? How Can We Know? It created quite a stir, and these days it’s difficult to find a popular science book that doesn’t mention Tetlock’s work in one fashion or another.
Having established that experts weren’t any good at prediction, he went on to wonder: is there anyone out there who might do better?
As Tetlock set out on this quest he established the following, somewhat obvious criteria for determining whether someone’s predictions were “better”:
It should be obvious if a prediction is accurate or not. Predictions should clearly state both what will happen and the timeframe in which it will occur.
These predictions must have a confidence level attached. For example a prediction that Iran will not have nuclear weapons by the end of 2024 might be assigned a confidence level of 90%, and a prediction that the Korean Peninsula will not be reunited might get one of 99%. When these predictions are graded, the ideal is for 90% of the 90% predictions to turn out to be true, not 95% or 85%. In the former case the person making the prediction was underconfident and in the latter case they were overconfident.3
Individual accuracy will be carefully tracked. Not only was this the whole point of the exercise, but Tetlock was interested in seeing if some individuals ended up being better on a broad range of subjects.
Before this, it wasn’t that no one cared about the accuracy of predictions. Tetlock wasn’t the first person to challenge experts on the accuracy of their forecasts, but experts employed various strategies to defend themselves. They could argue they were "almost" correct. Or they could insist that the events they had predicted would still come to pass, and shortly, they just hadn’t happened yet. They might even claim that their predictions had served as a warning, averting the danger they predicted. All of these tactics allowed them to maintain credibility even when their predictions didn’t come to pass. But Tetlock wasn’t interested in expert credibility, he was interested in accuracy and he plugged all of these holes.
As mentioned, Tetlock wanted to see if there were people who could outperform the experts on all of these measures. After conducting various tournaments, it turned out there were. He called these people Superforecasters, and they became the subject of his next book Superforecasting: The Art and Science of Prediction. Not only were these people better than the “experts”, but the predictions they made were accurate on a broad range of topics. Whereas historically “experts” only made forecasts in one or two subject areas, superforecasters made predictions on all manner of subjects.4 Nothing was off limits.
At first glance discovering that some people are superforecasters seems like an unalloyed good. We have gone from a world where experts made forecasts that were inaccurate, sensational, and limited in scope, to a world where superforecasters provide predictions which are accurate, commonplace, and broad.5 Who could have a problem with this development? Well, as it turns out, me.
II- Harlan Eustice’s Wife
In defense of my opposition, let’s turn to an analogy. I’m taking this analogy from Molly’s Game, a 2017 film written and directed by Aaron Sorkin. The film was adapted from a book of the same name by Molly Bloom about her experiences running exclusive illegal poker games. In the movie, one of Molly’s regular players, “Player X”, urges her to invite wealthy but inexperienced gamblers — commonly called "fishes". Molly is initially puzzled when Player X invites Harlan Eustice, an exceptionally-skilled player, to the game. However, Harlan’s composure unravels one night after losing a hand to the table’s weakest player. This triggers what gamblers term “tilt”, a state where disciplined play collapses into reckless behavior. By the end of the night, Harlan had lost a staggering $1.2 million. Previous to this he had made over half a million playing poker. Night after night of skillful play had been wiped out by one exceptionally bad night.
Could superforecasting have prevented this? Let’s imagine a hypothetical scenario wherein, before this all happens, Harlan’s wife wants to know what kind of risk he’s taking by playing poker.6 He assures her that he’s a great poker player, he’s mostly playing against rich idiots, and that gambling is actually a safe way of making money and improving their finances. Harlan appears to have a certain amount of expertise, but his wife has just read Tetlock’s book, and show knows apparent expertise can’t be trusted. So she hires some superforecasters to help her out.
This might sound like a fantastic idea, after all they’re the best in the business, able to make accurate predictions on a broad range of topics. As Telock demonstrated there’s no more accurate way of forecasting the future. Why wouldn’t she consult superforecasters?
One easy to overlook reason for caution is the mismatch between the question she wants answered and the question they’re willing to address. She wants to know if it’s a good idea for Harlan to play poker. The best they can do is offer a bounded prediction. For instance: will Harlan’s poker winnings exceed his losses in 2009. She figures this is close enough and requests the forecast.
Harlan really is a skillful player, and after examining his playing history the superforecasters predict with 90% confidence that he will make money in 2009.
She tells him to go ahead and play, reassured by the superforecasters. He’s clearly a good player, and they need the money. Exiting our hypothetical, Harlan really was a good player and most nights he returns home with several thousand dollars. Some of the time, in line with the imaginary forecast, he loses some money, maybe even several nights in a row. But he keeps his nightly losses to a few thousand dollars, so on net he’s doing really well. But then one night, in a perfect storm of aggravation, bad luck, and an unknown psychological weak spot Harlan loses over a million dollars. This completely wipes all the money he’s ever won, and leaves him over half a million dollars in the hole. Suddenly the difference between the question she wanted the answer to, and the question the superforecasters actually answered is enormous. The superforecasters‘ accuracy is entirely intact, but Harlan’s life is destroyed.
What if, before allowing Harlan to play, she had tried to dig deeper? Let’s say she really wanted to know what happens in the 10% of scenarios where he loses. The deeper she digs into the probability of a rare event happening the less useful superforecasting becomes. Superforecasting is about accuracy, data, and models. Rare events are difficult to model and there’s very little data, all of which make accuracy impossible. If superforecasters can’t achieve accuracy, they won’t make a prediction.
In the story of Harlan and his hapless wife we encounter the two, central problems of superforecasting:
First, it has a large blindspot with respect to rare, high-impact events—like Harlan losing everything in a single night.
Second, predictions are naturally assumed to be recommendations. But for something to be a recommendation it has to take into account the impact of all possible outcomes. Given the aforementioned blindspot to high impact events, it can’t do that.
I understand these are significant claims, let’s spend the next few sections unpacking them.
III- Black Swans
The vast majority of the time Harlan won, or lost very little, so losing $1.2 million was both rare and high impact. It overwhelmed all of his past poker success. There is a term for such rare, high-impact events. They’re called black swans.
This term comes from a book by Nassim Nicholas Taleb, and I’ll use it a lot going forward.7 Taleb points out that because such events are high impact they matter a great deal to the overall shape of the world. Additionally, their rarity makes them very difficult to predict. The question of whether Harlan should play poker hinged on the outcome of a single night, a single black swan.
Let’s turn to another example of a black swan: the COVID-19 pandemic. We’ll open the discussion with a quote from Alex Tabarrok of Marginal Revolution:
The Coronavirus Pandemic may be the most warned about event in human history. Surprisingly, we even did something about it. President George W. Bush started a pandemic preparation plan and so did Governor Arnold Schwarzenegger in CA but in both cases when a pandemic didn’t happen in the next several years those plans withered away. We ignored the important in favor of the urgent.
It is evident that the US government finds it difficult to invest in long-term projects, perhaps especially in preparing for small probability events with very large costs. Pandemic preparation is exactly one such project. How can we improve the chances that we are better prepared next time?
There are some interesting parallels between the situation of Harlan and the US government. People have been warning about the potential for a pandemic for a very long time. Similarly, there is no shortage of people who will warn about the risks of gambling, particularly the no-limit, illegal kind Harlan was engaged in. The government did make some efforts to hedge against these risks; similarly Harlan was a very meticulous and careful gambler most of the time. Eventually however both became lax and as a result they were unprepared for the black swan when it did arrive. Adding the methodology of superforecasting would have just encouraged this laxity. Superforecasters would have told the government that there was a 98% chance of there not being a pandemic in any given year, hardly a number to inspire a redoubling of effort. Those who truly want to be prepared need a different methodology, not one that asks what is likely to happen, but rather something closer to “What’s the worst that could happen?”
For an example of how this “worst case scenario” attitude might play out, consider the story of 3M. In the wake of SARS, 3M considered maintaining a surge capacity for N95 masks — a form of black swan preparedness. It was immediately obvious that such a thing would be, in their own words, “very unusual in a modern lean supply chain”. It might negatively affect their valuation. But 3M took that “unusual” step and maintained their surge capacity, even though it basically “gathered dust” for 20 years. When COVID-19 hit they were in a much better position than the vast majority of companies. Perhaps you remember the enormous supply chain shocks? Most companies had a “lean supply chain”, and this efficiency ultimating translated into fragility when the black swan emerged.8
There’s a phrase that often gets used for this behavior: “picking up pennies in front of a steam roller”. In a sense that’s what Harlan, and the government, and most companies were doing, while 3M was leaving them on the ground in order to focus on not getting crushed. The former were focused on 98% of years when there’s not a pandemic. 3M bothered to consider the 2% of the time when there is.
At a societal level, black swans (and approaching steam rollers) are so consequential that they deserve as much focus as we can give them. Unfortunately, our inclination is to ignore them, or to lose focus after a very short amount of time. Superforecasting is poorly suited to helping predict and prepare for such events. It works best in the 98% of the years where there is no pandemic, or the 90% of the time when Harlan is winning money. Since it’s in the area of rare events where all the impact occurs, increasing our reliance on superforecasting should therefore be subject to a very critical appraisal, which is the entire point of this book.
You may have noticed that in all these examples I didn’t offer any actual accounts of superforecasting as it relates to pandemic preparedness. That’s because in the years leading up to the pandemic, superforecasters didn’t have much, if anything, to say about the potential dangers of a global pandemic.
One can argue, as I basically have, that superforecasting is ill-suited to helping us prepare for black swans like the pandemic, and as such it’s entirely appropriate that they had nothing to say about it. However, given the enormous impact black swans have on the shape of the future, if superforecasting cedes this ground (as we’ll see in Part V, I don’t believe they have) then it relegates itself to the status of a mostly academic exercise. Sure, it’s useful around the margins, able to predict which shade of the status quo we will end up with, but not something that actually helps us deal with the true uncertainties we face.
As an example of this tension, once the pandemic actually got started we do have an example to draw on. On February 20, 2020 the Good Judgment Project (GJP)9 arrived at a 3% probability that there would be 200,000+ coronavirus cases a month later. They had to wait until the pandemic was already in progress and then when they did make a prediction they woefully underestimated the severity of the pandemic.10
Right at the moment when preparation and vigilance were needed the most, superforecasting pointed people in exactly the opposite direction.
IV- Misaligned Incentives
The goal of the GJP and similar efforts is to improve forecasting and predictions specifically by increasing the proportion of accurate predictions. Accuracy is their number one criteria; everything about the system is designed around increasing that metric. It’s how they’re graded. If they say that 90% of the time Harlan is going to win money playing poker, or that, for any given year, 98% of the time there will not be a pandemic, if those percentages end up being accurate, they are judged to be amazing forecasters. This is regardless of what happens the 2% of the time when there is a pandemic or how much money Harlan might lose on a really bad night. In other words, regardless of whether the impact of 2% of the outcomes far outstrips the impact of the 98% of the outcomes.
This drive for accuracy above all else encourages two secondary behaviors. The easiest way to be accurate is to predict that the status quo will continue: most years there is not a pandemic, and most nights Harlan wins money. But this creates a blindspot for black swans. They’re highly improbable and very difficult to predict. As such they’re not going to get any attention from superforecasters. (As evidenced by the mere 3% who correctly predicted the magnitude of the pandemic even when it was right at the doorstep.)
The second incentive is to increase the number of their predictions. This might seem unobjectionable, why wouldn’t we want more data to evaluate them by? The problem is not all predictions are equally difficult and not all events are equally impactful. The criteria established by Tetlock at the beginning of this endeavor has no element which takes into account impact. A superforecaster will be judged to be very successful even if all the things they got right were inconsequential and the things they got wrong were hugely impactful.11 In fact, Tetlock encourages people to prioritize questions within a certain range of difficulty. Consider his “first commandment” of superforecasting:
Focus on questions where your hard work is likely to pay off. Don’t waste time either on easy “clocklike” questions (where simple rules of thumb can get you close to the right answer) or on impenetrable “cloud-like” questions (where even fancy statistical models can’t beat the dart-throwing chimp). Concentrate on questions in the Goldilocks zone of difficulty, where effort pays off the most.
For instance, “Who will win the presidential election twelve years out, in 2028?” is impossible to forecast now. Don’t even try. Could you have predicted in 1940 the winner of the election, twelve years out, in 1952? If you think you could have known it would be a then-unknown colonel in the United States Army, Dwight Eisenhower, you may be afflicted by one of the worst cases of hindsight bias ever documented by psychologists.
The question that should immediately occur to everyone: are black swans more likely to be inside or outside the Goldilocks zone? Almost by definition, they’re going to be outside of this zone.12
All of this would appear to heavily incline superforecasting towards the streetlight effect, the idea that the old drunk looks for his keys under the streetlight, not because that’s where he lost them, but because that’s where the light is the best. Now to be fair, it’s not a perfect analogy. With respect to superforecasting there are actually lots of useful keys under the streetlight, and the superforecasters are very good at finding them. But based on everything I have already said, it would appear that all of the really important keys are out there in the dark, and as long as we continue to have superforecasters finding keys where the light is good, what inducement do we have to develop other methodologies, ones which actually have some chance of helping with the really important keys?13
Finally, this well deserved reputation for accuracy leads to the worst misalignment of all. The misalignment between the people consuming the predictions and the superforecasters making them. They’re not looking for an overall record of accuracy, they’re trying to extract a recommended course of action for a very specific concern.
Should Harlan play poker?
Can we put off pandemic preparedness for another year?
In both cases it’s very easy to see how superforecasting would have led us to answer “yes” to both questions.
V- The Hubris of the Superforecaster
If Tetlock, or superforecasters more generally, were upfront about all of the limitations I’ve mentioned then the disciple could still provide a lot of value. Instead, they have ended up going the exact opposite direction—reaching for ever more territory.
Instead of acknowledging that superforecasting is ill-suited to the assessment of extreme events, they decided to turn their attention to the most extreme events of all: X-risks. Namely, those risks that might lead to the extinction of all of humanity.
In 2022, Tetlock partnered with Ezra Karger of the Chicago Federal Reserve to create the Existential Risk Persuasion Tournament (XPT). This exercise involved assessing the X-risk of four different threats: AI, nuclear weapons, bioterror, and climate change. In each area, two groups were gathered: superforecasters and domain experts. They were both tasked with coming up with a probability that the given X-Risk would result in there being less than 5,000 humans sometime before 2100. Once the two groups had arrived at these assessments they all met together and tried to bring their two probabilities into agreement through discussion and the sharing of data. This was the persuasion part of the tournament. Interestingly, agreement was eventually reached in every domain except AI risk. In that domain the “median AI expert gave a 3.9% chance to an existential catastrophe owing to AI by 2100. The median superforecaster, by contrast, gave a chance of 0.38%.”
This order of magnitude difference, while fascinating, is beside the point. The question as always is how is this information going to be used? There is currently a huge, ongoing debate about what should be done about AI risk. Will those wanting to minimize AI risk use the superforecasters’ assessment to minimize this risk? To push against laws and regulations which might be analogous to California’s pandemic preparation? Probably. When they do so are they going to have any awareness of the very tight constraints put on the question? That superforecasters were being asked what the odds were that AI would end up killing 7.95 billion people? And that a whole universe of other potentially bad outcomes were outside the scope of their prediction? We are left in the dark about the odds of AI only killing four billion, or one billion, or even “just” 200 million people. All of which would be considered unimaginable tragedies. Or are they going to round things off to a general sense that the risk is “extremely low, almost not worth bothering about.”
Even if these numbers are never abused (which seems extremely unlikely) how are we to judge the success of these predictions? If AI doesn’t entirely wipe out humanity, were the superforecasters correct to offer a probability of 0.38% or were the domain experts correct with their probability of 3.9%? Or was the true probability 60% and we were just lucky?
By uncovering the poor track record of expert predictions, Tetlock provided a very valuable service. And we should keep track of the accuracy of people’s predictions. However we should not imagine that accurate predictions have anything to do with surviving the future. Because, when people take these predictions as recommendations, which they inevitably do, then you’ve made the real problems we need to confront, the black swans, harder to deal with, not easier.
VI- What Should Harlan’s Wife Have Done?
We imagined a hypothetical where Harlan’s wife consulted superforecasters to determine if he should play poker. This ended up being a mistake. What she should have done, both hypothetically and in reality, was listen to thousands of years of accumulated wisdom on the dangers of gambling—warnings which go all the way back to Plato. Had she done this (and had Harlan listened) she would have avoided the gigantic catastrophe.
To be fair the advice “Don’t gamble”, while straightforward, is fiendishly difficult to implement. And these difficulties have only increased. One major difficulty is that our world is very different from those of a thousand years ago. In fact, it’s very different from how it was fifty years ago, and even ten years ago. We can imagine that there is some bit of traditional wisdom which speaks to our problems, similar to how warnings about the dangers of gambling would have helped Harlan. We can even confidently assert that this wisdom would be more helpful than knowing that something only has a 10% chance of happening. But there is still the matter of translating that wisdom such that it's applicable to the problems we actually face. But in the same sense that this wisdom was available on the subject of gambling, I would argue that it’s available for most of the difficulties we grapple with. We just might have to dig a little. A significant number of the supplementary chapters are dedicated to doing just that.14
The wisdom which should have led Harlan to avoid poker is not just available in the arena of gambling, it’s available to most people in most endeavors. It does require some effort to interpret, but the greatest difficulty is finding the courage to act on it.
The risk is that by forswearing gambling all together it’s possible that they would have lost out on a lot of money. Before he went tilt, Harlan had made half a million dollars, so yes that’s certainly possible. It’s possible in the same way that California and the Federal government saved some money and short term headaches when they allowed their pandemic plans to wither away.
When we don’t use tools like superforecasting to increase our accuracy, we leave gains on the table. We’re less efficient, but the more we strive for efficiency the more fragile the world becomes. For every penny we pluck from in front of the steamroller, the more we increase the risk that one of these times it’s going to catch us. This is where traditional wisdom comes into play. A significant fraction of what counts as a wise act is just avoiding this sort of fragility.
I’ll conclude with one final analogy: Superforecasting consists of placing odds on the pennies lying in front of the steamroller. The steamroller is all of the potential black swans, and in reality black swans are a lot harder to spot than an actual steamroller, but they’re just as close. By focusing on the pennies we draw attention away from the steamroller. This is bad. To return to my single sentence summation: Our key difficulty isn’t predicting the future; it’s surviving it.
This should be the lynchpin of the entire book. If you don’t think it works, well then you’re wrong. It’s amazingly concise, and undoubtedly correct in every particular. If you don’t believe me, become a paid subscriber and let me have it!
Speaking of the book and paid subscribers. I’m already thinking that an every other week release schedule for chapters might have been too ambitious, particularly since I spent all this week doing a road trip to Gary Con, where I hung out with some subscribers. That’s the sort of benefit you can’t put a price on! But if you think it’s worth at least $50/year then you know what to do!
For an expanded history of this sort of forecasting with all of it’s various precursors and offshoots see Chapter Two “The Quest for Accurate Forecasting”.
My criticisms of superforecasting don’t depend on defending the old habit of sensational predictions, but I do think the benefits of such predictions have been ignored, while the harms have been overstated. See Chapter Ten “In Defense of Sensational Predictions” for a deeper exploration of this.
It turns out, for perhaps obvious reasons, that overconfidence is far more common than under confidence.
“Traditional” forecasters might predict company earnings, but not political results for example. On the other hand, Superforecasters routinely make predictions in areas as diverse as sports, economics, entertainment, and politics.
Much of what I say about superforecasting will also apply to prediction markets, and other systems where accuracy is rigorously defined and carefully tracked. I will get deeper into that in the supporting chapters. In particular see Chapter Five “But what about Prediction Markets?”
The wife isn’t hypothetical. Harlan (real name Houston Curtis) had a wife, and the night he lost $1.2 million he was supposed to be at her birthday party. A party he’d spent weeks planning.
Talebian philosophy is the major intellectual framework I’m using for my critique of superforecasting. If you’d like a deeper dive into this philosophy see Chapter Three.
The link between efficiency and fragility is one of those things that people either accept as intuitively obvious or vociferously deny. As such I’ve dedicated Chapter Four to explaining the linkage.
The GJP was an organization set up by Phillip Tetlock to engage in and monetize superforecasting. For more information see Chapter One.
Event’s with a 3% probability do happen 3% of the time, so you could argue that this was an accurate estimate. But the problem is not with their accuracy, but rather with how that “accuracy” leads people to be less prepared for black swans. Also it should be mentioned that for this question they asked their stable of superforecasters, and 3% made this estimate, which they report as a 3% chance. Whether that’s really the best methodology is an interesting question.
The GJP predictions for Brexit, Trump in 2016 and Trump in 2024 (as of 2023) were all incorrect. For a deeper dive here see Chapter Four.
If you want to get nitpicky black swans can be defined with varying levels of rigor. Taleb wouldn’t consider something like the pandemic to be a true “black swan”, precisely because of how foreseeable it was. But as we’ve seen superforecasters aren’t great at predicting even these lesser events, what might be termed “gray swans”. Well get into this distinction more in Chapter Three.
If you’re not entirely convinced that superforecasters have misaligned incentives, or if you want a deeper dive into this subject see Chapter Six.
See for instance: Chapter Eight- Traditional Wisdom? Really? And Chapter Nine - In Defense of Sensational Predictions.