My Final Case Against Superforecasting (with criticisms considered, objections noted, and assumptions buttressed)
If you prefer to listen rather than read, this blog is available as a podcast here. Or if you want to listen to just this post:
I.
One of my recent posts, Pandemic Uncovers the Limitations of Superforecasting, generated quite a bit of pushback. And given that in-depth debate is always valuable and that this subject, at least for me, is a particularly important one. I thought I’d revisit it, and attempt to further answer some of the objections that were raised the first time around. While also clarifying some points that people misinterpreted or gave insufficient weight to.
To begin with, you might wonder how anybody could be opposed to superforecasting, and what that opposition would be based on. Isn’t any effort to improve forecasting obviously a good thing? Well for me it’s an issue of survival and existential risk. And while questions of survival are muddier in the modern world than they were historically, I would hope that everyone would at least agree that it’s an area that requires extreme care and significant vigilance. That even if you are inclined to disagree with me, that questions of survival call for maximum scrutiny. Given that we’ve already survived the past, most of our potential difficulties lie in the future, and it would be easy to assume that being able to predict that future would go a long way towards helping us survive it, but that is where I and the superforecasters part company, and the crux of the argument.
Fortunately or unfortunately as the case may be, we are at this very moment undergoing a catastrophe, a catastrophe which at one point lay in the future, but not any more. A catastrophe we now wish our past selves and governments had done a better job preparing for. And here we come to the first issue: preparedness is different than prediction. An eventual pandemic was predicted about as well as anything could have been, prediction was not the problem. A point Alex Tabarrok made recently on Marginal Revolution:
The Coronavirus Pandemic may be the most warned about event in human history. Surprisingly, we even did something about it. President George W. Bush started a pandemic preparation plan and so did Governor Arnold Schwarzenegger in CA but in both cases when a pandemic didn’t happen in the next several years those plans withered away. We ignored the important in favor of the urgent.
It is evident that the US government finds it difficult to invest in long-term projects, perhaps especially in preparing for small probability events with very large costs. Pandemic preparation is exactly one such project. How can we improve the chances that we are better prepared next time?
My argument is that we need to be looking for the methodology that best addresses this question, and not merely how we can be better prepared for pandemics, but better prepared for all rare, high impact events.
Another term for such events is “black swans”, after the book by Nassim Nicholas Taleb, Which is the term I’ll be using going forward. (Though, Taleb himself would say that, at best, this is a grey swan, given how inevitable it was.) Tabarrok’s point, and mine, is that we need a methodology that best prepares us for black swans, and I would submit that superforecasting, despite its many successes, is not that method. And in fact it may play directly into some of the weaknesses of modernity that encourage black swans, and rather than helping to prepare for such events, superforecasting may in fact discourage such preparedness.
What are these weaknesses I’m talking about? Tabarrok touched on them when he noted that, “It is evident that the US government finds it difficult to invest in long-term projects, perhaps especially in preparing for small probability events with very large costs.” Why is this? Why were the US and California plans abandoned after only a few years? Because the modern world is built around the idea of continually increasing efficiency. And the problem is that there is a significant correlation between efficiency and fragility. A fragility which is manifested by this very lack of preparedness.
One of the posts leading up to the one where I criticized superforecasting was built around exactly this point, and related the story of how 3M considered maintaining a surge capacity for masks in the wake of SARS, but it was quickly apparent that such a move would be less efficient, and consequently worse for them and their stock price. The drive for efficiency led to them being less prepared, and I would submit that it’s this same drive that led to the “withering away” of the US and California pandemic plans.
So how does superforecasting play into this? Well, how does anyone decide where gains in efficiency can be realized or conversely where they need to be more cautious? By forecasting. And if a company or a state hires the Good Judgement Project to tell them what the chances are of a pandemic in the next five years and GJP comes back with the number 5% (i.e. an essentially accurate prediction) are those states and companies going to use that small percentage to justify continuing their pandemic preparedness or are they going to use it to justify cutting it? I would assume the answer to that question is obvious, but if you disagree then I would ask you to recall that companies almost always have a significantly greater focus on maximizing efficiency/profit, than on preparing for “small probability events with very large costs”.
Accordingly the first issue I have with superforecasting is that it can be (and almost certainly is) used as a tool for increasing efficiency, which is basically the same as increasing fragility. That rather than being used as a tool for determining which things we should prepare for it’s used as an excuse to avoid preparing for black swans, including the one we’re in the middle of. It is by no means the only tool being used to avoid such preparedness, but that doesn’t let it off the hook.
Now I understand that the link between fragility and efficiency is not going to be as obvious to everyone as it is to me, and if you’re having trouble making the connection I would urge you to read Antifragile by Taleb, or at least the post I already mentioned. Also, even if you find the link tenuous I would hope that you would keep reading because not only are there more issues but some of them may serve to make the connection clearer.
II.
If my previous objection represented my only problem with superforecasting then I would probably agree with people who say that as a discipline it is still, on net, beneficial. But beyond providing a tool that states and companies can use to justify ignoring potential black swans superforecasting is also less likely to consider the probability of such events in the first place.
When I mentioned this point in my previous post, the people who disagreed with me had two responses. First they pointed out that the people making the forecasts had no input on the questions they were being asked to make forecasts on and consequently no ability to be selective about the predictions they were making. Second, and more broadly they claimed that I needed to do more research and that my assertions were not founded in a true understanding of how superforecasting worked.
In an effort to kill two birds with one stone, since that last post I have read Superforecasting: The Art and Science of Prediction by Phillip Tetlock and Dan Gardner. Which I have to assume comes as close to being the bible of superforecasting as anything. Obviously, like anyone, I’m going to suffer from confirmation bias, and I would urge you to take that into account when I offer my opinion on the book. With that caveat in place, here, from the book, is the first commandment of superforecasting:
1) Triage
Focus on questions where your hard work is likely to pay off. Don’t waste time either on easy “clocklike” questions (where simple rules of thumb can get you close to the right answer) or on impenetrable “cloud-like” questions (where even fancy statistical models can’t beat the dart-throwing chimp). Concentrate on questions in the Goldilocks zone of difficulty, where effort pays off the most.
For instance, “Who will win the presidential election twelve years out, in 2028?” is impossible to forecast now. Don’t even try. Could you have predicted in 1940 the winner of the election, twelve years out, in 1952? If you think you could have known it would be a then-unknown colonel in the United States Army, Dwight Eisenhower, you may be afflicted by one of the worst cases of hindsight bias ever documented by psychologists.
The question which should immediately occur to everyone: are black swans more likely to be in or out the Goldilocks zone? It would seem that, almost by definition, they’re going to be outside of this zone. Also, just based on the book’s description of the zone and all the questions I’ve seen both in the book and elsewhere, it would seem clear they’re outside of the zone. Which is to say that even if such predictions are not misused, they’re unlikely to be made in the first place.
All of this would appear to heavily incline superforecasting towards the streetlight effect, where the old drunk looks for his keys under the streetlight, not because that’s where he lost them, but because that’s where the light is the best. Now to be fair, it’s not a perfect analogy. With respect to superforecasting there are actually lots of useful keys under the streetlight, and the superforecasters are very good at finding them. But based on everything I have already said, it would appear that all of the really important keys are out there in the dark, and as long as superforecasters are finding keys under the streetlight what inducement do they have to venture out into the shadows looking for keys? No one is arguing that the superforecasters aren’t good, but this is one of those cases where the good is the enemy of the best. Or more precisely it makes the uncommon the enemy of the rare.
It would be appropriate to ask at this point, if superforecasting is good, then what is “best”, and I intend to dedicate a whole section to that topic before this post is over, but for the moment I’d like to direct your attention to Toby Ord, and his recent book The Precipice: Existential Risk and the Future of Humanity, which I recently finished. (I’ll have a review of it in my month end round up.) Ord is primarily concerned with existential risks, risks which could wipe out all of humanity. Or to put it another way the biggest and blackest swans. A comparison of his methodology with the methodology of superforecasting might be instructive.
Oord spends a significant portion of the book talking about pandemics. On his list of eight anthropogenic risks, pandemics take up 25% of the spots (natural pandemics get one spot and artificial pandemics get the other). On the other hand, if one were to compile all of the forecasts made by the Good Judgement Project since the beginning, what percentage of them would be related to potential pandemics? I’d be very much surprised if it wasn’t significantly less than 1%. While such measures are crude, one method pays a lot more attention than the other, and in any accounting of why we weren’t prepared for the pandemic, a lack of attention would certainly have to be high on the list.
Then there are Oord’s numbers. He provides odds that various existential risks will wipe us all out in the next 100 years. The odds he gives for that happening with a naturally arising pandemic are 1 in 10,000, the odds for an engineered pandemic are 1 in 30. The foundation of superforecasting is the idea that we should grade people’s predictions. How does one grade predictions of existential risk? Clearly compiling a track record would be impossible, they’re essentially unfalsifiable, and beyond all that they’re well outside the Goldilocks zone. Personally I’d almost rather that Oord didn’t give odds and just spent his time screaming, “BE VERY, VERY AFRAID!” But he doesn’t, he provides odds and hopes that by providing numbers people will take him more seriously than if he just yells.
From all this you might still be unclear why Oord is better than the superforecasters. It’s because our world is defined by black swan events, and we are currently living out an example of that: our current world is overwhelmingly defined by the pandemic. If you were to selectively remove knowledge of just it from someone trying to understand the world absolutely nothing would make sense. Everyone understands this when we’re talking about the present, but it also applies to all past forecasting we engaged in. 99% of all superforecasting predictions lent nothing to our understanding of this moment, but 25% of Oord’s did. Which is more important: getting our 80% predictions about uncommon events to 95% or gaining any awareness, no matter how small, of a rare event which will end up dominating the entire world?
III.
At their core all of the foregoing complaints boil down to the idea that the methodology of superforecasting fails to take into account impact. The impact of not having extra mask capacity if a pandemic arrives. The impact of keeping to the Goldilocks zone and overlooking black swans. The impact of being wrong vs. the impact of being right.
When I made this claim in the previous post, once again several people accused me of not doing my research. As I mentioned, since then I have read the canonical book on the subject, and I still didn’t come across anything that really spoke to this complaint. To be clear, Tetlock does mention Taleb’s objections, and I’ll get to that momentarily, but I’m actually starting to get the feeling that neither the people who had issues with the last point, nor Tetlock himself really grasp this point, though there’s a decent chance I’m the one who’s missing something. Which is another point I’ll get to before the end. But first I recently encountered an example I think might be useful.
The movie Molly’s Game is about a series of illegal poker games run by Molly Bloom. The first set of games she runs is dominated by Player X, who encourages Molly to bring in fishes, bad players with lots of money. Accordingly, Molly is confused when Tobey Mcquire, Player X brings in Harlan Eustice, who ends up being a very skillful player. That is until one night when Eustice loses a hand to the worst player at the table. This sets him off, changing him from a calm and skillful player, into a compulsive and horrible player, and by the end of the night he’s down $1.2 million.
Let’s put some numbers on things and say that 99% of the time Eustice is conservative and successful and he mostly wins. That on average, conservative Eustice ends the night up by $10k. But, 1% of the time, Eustice is compulsive and horrible, and during those times he loses $1.2 million. And so our question is should he play poker at all? (And should Player X want him at the same table he’s at?) The math is straightforward, his expected return over 100 average games is -$210k. It would seem clear that the answer is "No, he shouldn't play poker."
But superforecasting doesn't deal with the question of whether someone should "play poker" it works by considering a single question, answering that question and assigning a confidence level to the answer. So in this case they would be asked the question, "Will Harlan Eustice win money at poker tonight?" To which they would say, "Yes, he will, and my confidence level in that prediction is 99%." That prediction is in fact accurate, and would result in a fantastic Brier score (the grading system for superforecasters), but by repeatedly following that advice Eustice eventually ends up destitute.
This is what I mean by impact, and why I'm concerned about the potential black swan blindness of superforecasting. When things depart from the status quo, when Eustice loses money, it's often so dramatic that it overwhelms all of the times when things went according to expectations. That the smartest behavior for Eustice, the recommended behavior, should be to never play poker regardless of the fact that 99% of the time he makes thousands of dollars an hour. Furthermore this example illustrates some subtleties of forecasting which often get overlooked:
If it's a weekly poker game you might expect the 1% outcome to pop up every two years, but it could easily take five years, even if you keep the probability the same. And if the probability is off by even a little bit (small probabilities are notoriously hard to assess) it could take even longer to see. Which is to say that forecasting during that time would result in continually increasing confidence, and greater and greater black swan blindness.
The benefits of wins are straightforward and easy to quantify. But the damage associated with the one big loss is a lot more complicated and may carry all manner of second order effects. Harlan may go bankrupt, get divorced, or even have his legs broken by the mafia. All of which is to say that the -$210k expected reward is the best outcome. Bad things are generally worse than expected. (For example it's been noted that even though people foresaw a potential pandemic, plans almost never touched on the economic disruption which would attend it, which ended up being the biggest factor of all.)
Unless you're Eustice, you may not care about the above example, or you may think that it's contrived, but in the realm of politics this sort of bet is fairly common. As an example cast your mind back to the Cuban Missile Crisis. Imagine that in addition to his advisors, that at that time Kennedy also could draw on the Good Judgement Project and superforecasting. Further imagine that the GJP comes back with the prediction that if we blockade Cuba that the Russians will back down, a prediction they're 95% confident of. Let's further imagine that they called the odds perfectly. In that case, should the US have proceeded with the blockade? Or should we have backed down and let the USSR base missiles in Cuba? When you just look at that 95% the answer seems obvious. But shouldn’t some allowance be made for the fact that the remaining 5% contains the possibility of all out nuclear war?
As near as I can tell, that part isn't explored very well by superforecasting. Generally they get a question, they provide the answer and assign a confidence level to that answer. There’s no methodology for saying that despite the 95% probability that such gambles are bad ideas because if we make enough of them eventually we’ll “go bust”. None of this is to say that we should have given up and submitted to Soviet domination because it's better than a full on nuclear exchange. (Though there were certainly people who felt that way.) More that it was a complicated question with no great answer (though it might have been a good idea for the US to not to put missiles in Turkey.) But by providing a simple answer with a confidence level of 95% superforecasting gives decision makers every incentive to substitute the true, and very difficult questions of nuclear diplomacy with the easy question of whether to blockade. That rather than considering the difficult and long term question of whether Eustice should gamble at all, we're substituting the easier question of just whether he should play poker tonight.
In the end I don't see any bright line between a superforecaster saying there's a 95% chance the Cuban Missile Crisis will end peacefully if we blockade, or a 99% chance Eustice will win money if he plays poker tonight, and those statements being turned into a recommendation for taking those actions, when in reality both may turn out to be very bad ideas.
IV.
All of the foregoing is an essentially Talebian critique of superforecasting, and as I mentioned earlier, Tetlock is aware of this critique. In fact he calls it, "the strongest challenge to the notion of superforecasting." And in the final analysis it may be that we differ merely in whether that challenge can be overcome or not. Tetlock thinks it can, I have serious doubts, particularly if the people using the forecasts are unaware of the issues I’ve raised.
Frequently people confronted with Taleb’s ideas of extreme events and black swans end up countering that we can’t possibly prepare for all potential catastrophes. Tetlock is one of those people and he goes on to say that even if we can’t prepare for everything that we should still prepare for a lot of things, but that means we need to establish priorities, which takes us back to making forecasts in order to inform those priorities. I have a couple of responses to this.
It is not at all clear that the forecasts one would make about which black swans to be most worried about follow naturally from superforecasting. It's likely that superforecasting with its emphasis on accuracy and making predictions in the Goldilocks zone systematically draws attention away from rare impactful events. Oord makes forecasts, but his emphasis is on identifying these events rather making sure the odds he provides are accurate.
I think that people overestimate the cost of preparedness and how much preparing for one thing, makes you prepared for lots of things. One of my favorite quotes from Taleb illustrates the point:
If you have extra cash in the bank (in addition to stockpiles of tradable goods such as cans of Spam and hummus and gold bars in the basement), you don't need to know with precision which event will cause potential difficulties. It could be a war, a revolution, an earthquake, a recession, an epidemic, a terrorist attack, the secession of the state of New Jersey, anything—you do not need to predict much, unlike those who are in the opposite situation, namely, in debt. Those, because of their fragility, need to predict with more, a lot more, accuracy.
As Taleb points out stockpiling reserves of necessities blunts the impact of most crises. Not only that, but even preparation for rare events ends up being pretty cheap when compared to what we’re willing to spend once the crisis hits. As I pointed out in a previous post, we seem to be willing to spend trillions of dollars once the crisis hits, but we won’t spend a few million to prepare for crises in advance.
Of course as I pointed at at the beginning having reserves is not something the modern world is great at. Because reserves are not efficient. Which is why the modern world is generally on the other side of Taleb's statement, in debt and trying to ensure/increase the accuracy of their predictions. Does this last part not exactly describe the goal of superforecasting? I’m not saying it can’t be used in service of identifying what things to hold in reserve or what rare events to prepare for I’m saying that it will be used far more often in the opposite way, in a quest for additional efficiencies and as a consequence greater fragility.
Another criticism people had about the last episode was that it lacked recommendations for what to do instead. I’m not sure that lack was as great as some people said, but still, I could have done better. And the foregoing illustrates what I would do differently. As Tabarrok said at the beginning, “The Coronavirus Pandemic may be the most warned about event in human history.” And yet if we just consider masks our preparedness in terms of supplies and even knowledge was abysmal. We need more reserves, we need to select areas to be more robust and less efficient in, we need to identify black swans, and once we have, we should have credible long term plans for dealing with them which aren’t scrapped every couple of years. Perhaps there is some place for superforecasting in there, but that certainly doesn’t seem like where you would start.
Beyond that, there are always proposals for market based solutions. In fact the top comment on the reddit discussion of the previous article was, “Most of these criticisms are valid, but are solved by having markets.” I am definitely also in favor of this solution as well, but there’s a lot of things to consider in order for it to actually work. A few examples off the top of my head:
What’s the market based solution to the Cuban Missile Crisis? How would we have used markets to navigate the Cold War with less risk? Perhaps a system where we offer prizes for people predicting crises in advance. So maybe if someone took the time to extensively research the “Russia puts missiles in Cuba” scenario, when that actually happens they gets a big reward?
Of course there are prediction markets, which seems to be exactly what this situation calls for, but personally I’m not clear how they capture impact problem mentioned above, also they’re still missing more big calls than they should. Obviously part of the problem is that overregulation has rendered them far less useful than they could be, and I would certainly be in favor of getting rid of most if not all of those regulations.
If you want the markets to reward someone for predicting a rare event, the easiest way to do that is to let them realize extreme profits when the event happens. Unfortunately we call that price gouging and most people are against it.
The final solution I’ll offer is the solution we already had. The solution superforecasting starts off by criticizing. Loud pundits making improbable and extreme predictions. This solution was included in the last post, but people may not have thought I was serious. I am. There were a lot of individuals who freaked out every time there was a new disease outbreak, whether it was Ebola, SARS or Swine Flu. And not only were they some of the best people to listen to when the current crisis started, we should have been listening to them even before that about the kind of things to prepare for. And yes we get back to the idea that you can’t act on the recommendations of every pundit making extreme predictions, but they nevertheless provide a valuable signal about the kind of things we should prepare for, a signal which superforecasting rather than boosting actively works to suppress.
None of the above directly replaces superforecasting, but all of them end up in tension with it, and that’s the problem.
V.
It is my hope that I did a better job of pointing out the issues with superforecasting on this second go around. Which is not to say the first post was terrible, but I could have done some things better. And if you’ll indulge me a bit longer (and I realize if you’ve made it this far you have already indulged me a lot) a behind the scenes discussion might be interesting.
It’s difficult to produce content for any length of time without wanting someone to see it, and so while ideally I would focus on writing things that pleased me, with no regard for any other audience, one can’t help but try the occasional experiment in increasing eyeballs. The previous superforecasting post was just such an experiment, in fact it was two experiments.
The first experiment was one of title selection. Should you bother to do any research into internet marketing they will tell you that choosing your title is key. Accordingly, while it has since been changed to “limitations” the original title of the post was “Pandemic Uncovers the Ridiculousness of Superforecasting”. I was not entirely comfortable with the word “ridiculousness” but I decided to experiment with a more provocative word to see if it made any difference. And I’d have to say that it did. In their criticism of it, a lot of people mentioned that world or the attitude implied in the title in general. But it also seemed that more people read it in the first place because of the title. Leading to the perpetual conundrum: saying superforecasting is ridiculous was obviously going too far, but would the post have attracted fewer readers without that word? If we assume that the body of the post was worthwhile (which I do, or I wouldn’t have written it) is it acceptable to use a provocative title to get people to read something? Obviously the answer for the vast majority of the internet is a resounding yes, but I’m still not sure, and in any case I ended up changing it later.
The second experiment was less dramatic, and one that I conduct with most of my posts. While writing them I imagine an intended audience. In this case the intended audience was fans of Nassim Nicholas Taleb, in particular people I had met while at his Real World Risk Institute back in February. (By the way, they loved it.) It was only afterwards, when I posted it as a link in a comment on the Slate Star Codex reddit that it got significant attention from other people, who came to the post without some of the background values and assumptions of the audience I’d intended for. This meant that some of the things I could gloss over when talking to Taleb fans were major points of contention with SSC readers. This issue is less binary than the last one, and other than writing really long posts it’s not clear what to do about it, but it is an area that I hope I’ve improved on in this post, and which I’ll definitely focus on in the future.
In any event the back and forth was useful, and I hope that I’ve made some impact on people’s opinions on this topic. Certainly my own position has become more nuanced. That said if you still think there’s something I’m missing, some post I should read or video I should watch please leave it in the comments. I promise I will read/listen/watch it and report back.
Things like this remind me of the importance of debate, of the grand conversation we’re all involved in. Thanks for letting me be part of it. If you would go so far as to say that I’m an important part of it consider donating. Even $1/month is surprisingly inspirational.