If you prefer to listen rather than read, this blog is available as a podcast here. Or if you want to listen to just this post:
I.
As near as I can reconstruct, sometime in the mid-80s Phillip Tetlock decided to conduct a study on the accuracy of people who made their living “commenting or offering advice on political and economic trends”. The study lasted for around twenty years and involved 284 people. If you’re reading this blog you probably already know what the outcome of that study was, but just in case you don’t or need a reminder here’s a summary.
- Over the course of those twenty years Tetlock collected 82,361 forecasts, and after comparing those forecasts to what actually happened he found:
- The better known the expert the less reliable they were likely to be.
- Their accuracy was inversely related to their self-confidence, and after a certain point their knowledge as well. (More actual knowledge about, say, Iran led them to make worse predictions about Iran than people who had less knowledge.)
- Experts did no better at predicting than the average newspaper reader.
- When asked to guess between three possible outcomes for a situation, status quo, getting better on some dimension, or getting worse, the actual expert predictions were less accurate than just naively assigning a ⅓ chance to each possibility.
- Experts were largely rewarded for making bold and sensational predictions, rather than making predictions which later turned out to be true.
For those who had given any thought to the matter, Tetlock’s discovery that experts are frequently, or even usually wrong was not all that surprising. Certainly he wasn’t the first to point it out, though the rigor of his study was impressive, and he definitely helped spread the idea with his book Expert Political Judgement: How Good Is It? How Can We Know? Which was published in 2005. Had he stopped there we might be forever in his debt, but from pointing out that the experts were frequently wrong, he went on to wonder, is there anyone out there who might do better? And thus began the superforecaster/Good Judgement project.
Most people, when considering the quality of a prediction, only care about whether it was right or wrong, but in the initial study, and in the subsequent Good Judgement project, Tetlock also asks people to assign a confidence level to each prediction. Thus someone might say that they’re 90% sure that Iran will not build a nuclear weapon in 2020 or that they’re 99% sure that the Korean Peninsula will not be reunited. When these predictions are graded, the ideal is for 90% of the 90% predictions to turn out to be true, not 95% or 85%, in the former case they were under confident and in the latter case they were overconfident. (For obvious reasons the latter is far more common). Having thus defined a good forecast Tetlock set out to see if he could find such people, people who were better than average at making predictions. He did. And it became the subject of his next book Superforecasting: The Art and Science of Prediction.
The book’s primary purpose is to explain what makes a good forecaster and what makes a good forecast. As it turns out one of the key features of that was that superforecasters are far more likely to predict that things will continue as they have. While those forecasters who appear on TV and who were the subject of Tetlock’s initial study are far more likely to predict some spectacular new development. The reason for this should be obvious, that’s how you get noticed. That’s what gets the ratings. But if you’re more interested in being correct (at least more often than not) then you predict that things will basically be the same next year as they were this year. And I am not disparaging that, we should all want to be more correct than not, but trying to maximize your correctness does have one major weakness. And that is why, despite Tetlock’s decades long effort to improve forecasting, I am going to argue that Tetlock’s ideas and methodology have actually been a source of significant harm, and have made the world less prepared for future calamities rather than more.
II.
To illustrate what I mean, I need an example. This is not the first time I’ve written on this topic, I actually did a post on it back in January of 2017, and I’ll probably be borrowing from it fairly extensively, including re-using my example of a Tetlockian forecaster: Scott Alexander of Slate Star Codex.
Now before I get into it, I want to make it clear that I like and respect Alexander A LOT, so much so that up until recently, and largely for free (there was a small Patreon) I read and recorded every post from his blog and distributed it as a podcast. The reason Alexander can be used as an example is that he’s so punctilious about trying to adhere to the “best practices” of rationality, which is precisely the position Tetlock’s methods hold at the moment. This post is an argument against that position, but at the moment they’re firmly ensconced.
Accordingly, Alexander does a near perfect job of not only making predictions but assigning a confidence level to each of them. Also, as is so often the case he beat me to the punch on making a post about this topic, and while his post touches on some of the things I’m going to bring up, I don’t think it goes far enough, or offers its conclusion quite as distinctly as I intend to do.
As you might imagine, his post and mine were motivated by the pandemic, in particular the fact that traditional methods of prediction appeared to have been caught entirely flat footed, including the Superforecasters. Alexander mentions in his post that “On February 20th, Tetlock’s superforecasters predicted only a 3% chance that there would be 200,000+ coronavirus cases a month later (there were).” So by that metric the superforecasters failed, something both Alexander and I agree on, but I think it goes beyond just missing a single prediction. I think the pandemic illustrates a problem with this entire methodology.
What is that methodology? Well, the goal of the Good Judgement project and similar efforts is to improve forecasting and predictions specifically by increasing the proportion of accurate predictions. This is their incentive structure, it’s how they’re graded, it’s how Alexander grades himself every year. This encourages two secondary behaviors, the first is the one I already mentioned, the easiest way to be correct is to predict that the status quo will continue, this is fine as far as it goes, the status quo largely does continue, but the flip side of that is a bias against extreme events. These events are extreme in large part because they’re improbable, thus if you want to be correct more often than not, such events are not going to get any attention. Meaning their skill set and their incentive structure are ill suited to extreme events (as evidenced by the 3% who correctly predicted the magnitude of the pandemic I mentioned above).
The second incentive is to increase the number of their predictions. This might seem unobjectionable, why wouldn’t we want more data to evaluate them by? The problem is not all predictions are equally difficult. To give an example from Alexander’s list of predictions (and again it’s not my intention to pick on him, I’m using him as an example more for the things he does right than the things he does wrong) from his most recent list of predictions, out of 118, 80 were about things in his personal life, and only 38 were about issues the larger world might be interested in.
Indisputably it’s easier for someone to predict what their weight will be or whether they will lease the same car when their current lease is up, than it is to predict whether the Dow will end the year above 25,000. And even predicting whether one of his friends will still be in a relationship is probably easier as well, but more than that, the consequences of his personal predictions being incorrect are much less than the consequences of his (or other superforecasters) predictions about the world as a whole being wrong.
III.
The first problem to emerge from all of this is that Alexander and the Superforecasters rate their accuracy by considering all of their predictions regardless of their importance or difficulty. Thus, if they completely miss the prediction mentioned above about the number of COVID-19 cases on March 20th, but are successful in predicting when British Airways will resume service to Mainland China their success will be judged to be 50%. Even though for nearly everyone the impact of the former event is far greater than the impact of the latter! And it’s worse than that, in reality there are a lot more “British Airways” predictions being made than predictions about the number of cases. Meaning they can be judged as largely successful despite missing nearly all of the really impactful events.
This leads us to the biggest problem of all, the methodology of superforecasting has no system for determining impact. To put it another way, I’m sure that the Good Judgement project and other people following the Tetlockian methodology have made thousands of forecasts about the world. Let’s be incredibly charitable and assume that out of all these thousands of predictions, 99% were correct. That out of everything they made predictions about 99% of it came to pass. That sounds fantastic, but depending on what’s in the 1% of the things they didn’t predict, the world could still be a vastly different place than what they expected. And that assumes that their predictions encompass every possibility. In reality there are lots of very impactful things which they might never have considered assigning a probability to. That in fact they could actually be 100% correct about the stuff they predicted but still be caught entirely flat footed by the future because something happened they never even considered.
As far as I can tell there were no advance predictions of the probability of a pandemic by anyone following the Tetlockian methodology, say in 2019 or earlier. Or any list where “pandemic” was #1 on the “list of things superforecasters think we’re unprepared for”, or really any indication at all that people who listened to superforecasters were more prepared for this than the average individual. But the Good Judgement Project did try their hand at both Brexit and Trump and got both wrong. This is what I mean by the impact of the stuff they were wrong about being greater than the stuff they were correct about. When future historians consider the last five years or even the last 10, I’m not sure what events they will rate as being the most important, but surely those three would have to be in the top 10. They correctly predicted a lot of stuff which didn’t amount to anything and missed predicting the few things that really mattered.
That is the weakness of trying to maximize being correct. While being more right than wrong is certainly desirable. In general the few things the superforecasters end up being wrong about are far more consequential than all things they’re right about. Also, I suspect this feeds into the classic cognitive bias, where it’s easy to ascribe everything they correctly predicted to skill while every time they were wrong gets put down to bad luck. Which is precisely what happens when something bad occurs.
Both now and during the financial crisis when experts are asked why they didn’t see it coming or why they weren’t better prepared they are prone to retort that these events are “black swans”. “Who could have known they would happen?” And as such, “There was nothing that could have been done!” This is the ridiculousness of superforecasting, of course pandemics and financial crises are going to happen, any review of history would reveal that few things are more certain.
Nassim Nicholas Taleb, who came up with the term, has come to hate it for exactly this reason, people use it to excuse a lack of preparedness and inaction in general, when the concept is both more subtle and more useful. These people who throw up their hands and say “It was a black swan!” are making an essentially Tetlockian claim: “Mostly we can predict the future, except on a few rare occasions where we can’t, and those are impossible to do anything about.” The point of the Taleb’s black swan theory and to a greater extent his idea of being antifragile is to point out that you can’t predict the future at all, and when you convince yourself that you can it distracts you from hedging/lessening your exposure to/preparing for the really impactful events which are definitely coming.
From a historical perspective financial crashes and pandemics have happened a lot, business and governments really had no excuse for not making some preparation for the possibility that one or the other, or as we’re discovering, both, would happen. And yet they didn’t. I’m not claiming that this is entirely the fault of superforecasting. But superforecasting is part of the larger movement of convincing ourselves that we have tamed randomness, and banished the unexpected. And if there’s one lesson from the pandemic greater than all others it should be that we have not.
Superforecasting and the blindness to randomness are also closely related to the drive for efficiency I mentioned recently. “There are people out there spouting extreme predictions of things which largely aren’t going to happen! People spend time worrying about these things when they could be spending that time bringing to pass the neoliberal utopia foretold by Steven Pinker!” Okay, I’m guessing that no one said that exact thing, but boiled down this is their essential message.
I recognize that I’ve been pretty harsh here, and I also recognize that it might be possible to have the best of both worlds. To get the antifragility of Taleb with the rigor of Tetlock, indeed in Alexander’s recent post, that is basically what he suggests. That rather than take superforecasting predictions as some sort of gold standard that we should use them to do “cost benefit analysis and reason under uncertainty.” That, as the title of his post suggests, this was not a failure of prediction, but a failure of being prepared, suggesting that predicting the future can be different from preparing for the future. And I suppose they can be, the problem with this is that people are idiots, and they won’t disentangle these two ideas. For the vast majority of people and corporations and governments predicting the future and preparing for the future are the same thing. And when combined with a reward structure which emphasizes efficiency/fragility, the only thing they’re going to pay attention to is the rosy predictions of continued growth, not preparing for dire catastrophes which are surely coming.
To reiterate, superforecasting, by focusing on the number of correct predictions, without considering the greater impact of the predictions they get wrong, only that such missed predictions be few in number, has disentangled prediction from preparedness. What’s interesting is that while I understand the many issues with the system they’re trying to replace, of bloviating pundits making predictions which mostly didn’t come true, that system did not suffer from this same problem.
IV.
In the leadup to the pandemic there were many people predicting that it could end up being a huge catastrophe (including Taleb, who said it to my face) and that we should take draconian precautions. These were generally the same people who issued the same warnings about all previous new diseases, most of which ended up fizzling out before causing significant harm, for example Ebola. Most people are now saying we should have listened to them. At least with respect to COVID-19, but these are also generally the same people who dismissed previous worries as being pessimistic, or of panicking, or of straight up being crazy. It’s easy to see they were not, and this illustrates a very important point. Because of the nature of black swans and negative events, if you’re prepared for a black swan it only has to happen once for your caution to be worth it, but if you’re not prepared then in order for that to be a wise decision it has to NEVER happen.
The financial crash of 2007-2008 represents an interesting example of this phenomenon. An enormous number of financial models was based on this premise that the US had never had a nationwide decline in housing prices. And it was a true and accurate position for decades, but the one year it wasn’t true made the dozens of years when it was true almost entirely inconsequential.
To take a more extreme example imagine that I’m one of these crazy people you’re always hearing about. I’m so crazy I don’t even get invited on TV. Because all I can talk about is the imminent nuclear war. As a consequence of these beliefs I’ve moved to a remote place and built a fallout shelter and stocked it with a bunch of food. Every year I confidently predict a nuclear war and every year people point me out as someone who makes outlandish predictions to get attention, because year after year I’m wrong. Until one year, I’m not. Just like with the financial crisis, it doesn’t matter how many times I was the crazy guy with a bunker in Wyoming, and everyone else was the sane defender of the status quo, because from the perspective of consequences they got all the consequences of being wrong despite years and years of being right, and I got all the benefits of being right despite years and years of being wrong.
The “crazy” people who freaked out about all the previous potential pandemics are in much the same camp. Assuming they actually took their own predictions seriously and were prepared, they got all the benefits of being right this one time despite many years of being wrong, and we got all the consequences of being wrong, in spite of years and years, of not only forecasts, but SUPER forecasts telling us there was no need to worry.
I’m predicting, with 90% confidence that you will not find this closing message to be clever. This is an easy prediction to make because once again I’m just using the methodology of predicting that the status quo will continue. Predicting that you’ll donate is the high impact rare event, and I hope that even if I’ve been wrong every other time, that this time I’m right.
TL;DR: accurate forecasting usually means predicting things stay the same; accurately forecasting the outlier events we care about is nigh impossible, and the best you can do is prepare for the worst.
I think the same thing every year when I read Scott’s calibration and grading post. “Why are all these censored and meaningless predictions on here? He has the prediction of who will win the president right next to whether Bitcoin will be higher than other crypto-currencies. How are those remotely equivalent? Yet they’re in the same calibration calculation as though they mean the same thing. As though they are the same type of prediction. But as you point out, there’s a qualitative difference between being able to predict an outlier event and being able to predict the absence of an outlier event. Maybe you can predict with 95% accuracy that something two standard deviations from the mean isn’t going to happen, but who cares? We really need to know about the 5% of cases that matter, but it’s exactly the kind of situation we most care about that’s also the kind of situation we can’t predict with any degree of accuracy.
Yet there are a lot of people using sophisticated mathematical techniques to fool themselves into believing they know more information than they actually know. (https://xkcd.com/2295/)
(Note: the TL;DR was not for my comment, but for your post – which was one of your better offerings, by the way.)
Thanks for the compliment! And I’m glad I not only one who noticed the strangeness of this methodology. I actually mentioned it to Scott the one time we had dinner, but I don’t think I presented it as well as I could have, in any case he seemed unconvinced.
And that XKCD post is timely, that’s for sure.
Hmmm, obviously a handicap might be helpful. The question that’s worth asking amounts to something like “will there be a lot of drama this year or not?”. Ebola under Obama, no drama. Covid under Trump, lots of drama. If correctly predicting drama was considered a touchdown (and incorrectly predicting drama a negative touchdown) but predicting status quo a field goal, you might uncover ‘useful superforecasters’ versus less useful ones.
Two things that might be helpful to also consider.
I got the sense early of ‘the ground being ripe’. It was a sense that we had a lot of dry seasons to date hence a fire was due and it would be a big one. Where did this come from and does it mean I’m a super-forecaster?
Well I think it came from the idea that we were getting lucky a bit too much. I see more and more cases of pure incompetence in gov’t being excused and papered over. In the corporate sector much the same. ‘Points of drama’ erupting and then going away. Recession signals in the market but then they reverse. Massive fires in Australia, but then they are out. None of this seems dealt with very well but the problem resolves for now…..
You have a kid. He starts driving drunk a lot. You worry and yell. He gets pulled over! Cop lets him off with an unofficial warning. He scraped the side of the car, no one saw him and he came home. He’s starting to steal stuff. He’s doing hard drugs.
I suspect you will start making some very dramatic forecasts ….and they’ll often be wrong:
– Dead in a car crash.
– Hurt in car crash.
– Arrested for hurting someone else.
– Arrested for killing someone else.
– Arrested for stealing.
– Shot by someone he’s stealing from
– Beaten up by others
Your score would go pretty low since at most only two or three of these predictions could turn out right….although you are quite rational to worry about all of them happening. But if you predict “serious drama coming” you would be right and you would score pretty high.
Second issue is how should you prepare? Obviously if you bet the farm on arrest, you might put a lawyer on a retainer coupled with legal insurance. That’s great but if the ‘drama’ is that he is seriously injured but not arrested, you’ve wasted your preparation.
I’ve written to you about my experience with Hurricane Sandy. The year before we had a freak snowstorm in October that knocked power out for a week. I dealt with it. Then in Sandy power was out about two weeks or so…..but gas stations didn’t have gas. Didn’t happen before. Now there’s plenty of gas but toilet paper was the problem (not really anymore but let’s go with that). In retrospect if I could go back to December I would tell my old self to buy a few boxes of N95 masks and some big containers of sanitizer. I already got a big thing of TP so I’m find there. To be honest, I’ve made out ok without those things.
I suppose then that one thing we could do is buy everything…..manage our shopping in such a way that we always have a 3+ week supply of everything needed. This doesn’t mean a ton of ’50 year food’ collecting mold in the basement. That’s an ok idea I guess but not as helpful as you might think.
I think the most valuable skill here is less predicting in advance what the ‘drama’ will be but to recognize it when it comes and immediately shift off the comfortable , well worn path to the uncomfortable one. As this thing was building up, I immediately saw it was going to be dramatic. I only needed the lowest fatality rate of 0.6% to realize you ties that by the US population and you realize this was drama.
As it got more serious, NJ started piecemeal changes. First asking people to wash hands and stay home. Then gatherings of 500. Then gatherings of 50. When I was in Morristown on a Tuesday night I noticed all the trendy bars were totally empty with one or two couples alone. Smart already sensed trouble was coming. On Facebook people would comment “Fuck Murphy” whenever a new policy was announced and I would get yelled at when I said we should close the schools (“We only have 20 cases. We only have 40 cases. We only have 80 cases etc.”).
Then shutdown, toilet paper panic and the rest is history. But it would have been better if we had gone right to hard shutdown immediately and snuffed it out. We didn’t recognize that drama was afoot and we didn’t recognize the right dramatic actions to take to deal with it. I remember chatting with a school board member saying it was time to close the schools and in the back of my head thinking “if no one dies from this we will never hear the end of it, they’ll say it was all staged to make Trump look bad.”.
Globally the countries that are doing best appear to be S. Korea, Hong Kong, Taiwan, and perhaps Singapore. These are all countries burned harder by SARS so they acted quickly and intelligently. The US was *not* burned by these things and critical tasks like refreshing stockpiles depleted by H1N1 were allowed to linger. It would be helpful if we could maybe apply prediction not so much to super forecasting but learning lessons without having to ‘burn our hand on the stove’ so to speak. But then that in itself might add ‘dry kindling’ to a future fire.
It might be the case the reason Trump felt it was no big deal to dismantle the pandemic team, not refresh stockpiles and simply not care was because despite him raking Obama over the coals on ebola, he had assumed these things just always burn out before they have much impact on the US. Because SARS and MERS never reached the US in any serious way, because Obama did pretty good on ebola and because in respect to swine flu the Presidents nagged everyone to get flu shots but then nothing happened of note on the ‘drama scale’, we felt it was no big deal spending most of 2019 talking about how much indulgence we should afford to vaccine deniers.
Chinese saying is 3 generations from poverty to wealth and 3 more back to poverty doesn’t offer any way out of that cycle.
No mention of East Asian success means you, sir, are racist. The failure was never predicting the spread, it was predicting the Western institutional response would be a lot worse than that of the West of the 1910s and a lot worse than that of today’s China.
I assume you’re joking about being a racist? Maybe not, who can tell these days.
Did East Asia use superforecasting as part of their pandemic preparedness? My sense was they used the hard won experience of SARS to gain the knowledge/culture necessary to be prepared.
As far as where the failure occurred, it occurred at lots of levels, but I think one of the bigger failures was in a push for efficiency over inefficient hedging/preparation. See my other post on the topic:
https://wearenotsaved.com/2020/03/27/the-fragility-of-efficiency-and-the-coronavirus/
And I would argue that superforecasting is part of that trend. A trend that the East Asians, to their great credit, avoided making. In part just by having so much local manufacturing of the things that ended up being important.
I suspect a big problem was our reluctance to deploy masks early. It is true the manufacturing supply lines cannot create them rapidly but improvisational masks made by the general public are highly effective when everyone wears them. After this, though, I suspect almost everyone will have a box of masks in their homes at all times making this particular aspect of the crises one that will not repeat.
1. If predictions are probabilistic, you can’t win by just predicting status quo, at least not forever. If the chance of financial crash is 5% per year, then the person who predicts 0% per year will (eventually) do worse than the person who predicts 5% per year. Although predicting 5% per year *including* the year when the crash actually happens sounds bad, we’ve gained important knowledge if we’ve legitimately figured out that the chance of a crash is 5% per year (as opposed to 1% or 10% or something). And also, the person who is a good predictor and increases their prediction the year of the crash will do better than the person who doesn’t (contra how the world would look if you always predicted status quo).
Another way of looking at this is that you never predict status quo (“0% chance of anything changing”), you usually predict your prior (eg “5% of financial crash per year”). The people who are genuinely better predictors will predict higher than their prior during years when P(financial crash) is genuinely higher, and lower than their prior during years when P(financial crash) is genuinely lower, and this will beat the person who always predicts 0% *or* who always predicts their prior. If you can’t do this (ie if someone making their best effort to predict a crash will always end up worse than the person who just sticks to 0% or just sticks to their prior) then you’re arguing that financial crashes are impossible to predict – which would mean superforecasters would be right to avoid trying to predict them and to focus on other, more predictable things (but I don’t think this is true).
2. I think I’m correct to lump together boring personal predictions and important world news predictions. First and trivially, I’m clearly not cheating by including easy personal predictions, because I don’t do any worse or better on personal predictions than on world predictions. But more important, I make it clear that what I’m doing is a calibration exercise – learning to map subjective states of belief to numbers. As an analogy, consider testing a telescope. You want to see whether it’s a good telescope, and fiddle with the settings to try to make it as clear as possible. So you might test it on a distant asteroid and see if you can make out how many craters there are. And you might test it on an alien invasion fleet and try to make out many troops they are bringing to invade Earth. One of these is much more important than the other, but they are both equally good as tests of your telescope machinery. If one telescope was able to count the craters on a certain asteroid more accurately than another, and then we could confirm that by reaching the asteroid and getting a gold standard crater count, that would make us more confident in using that telescope to analyze the alien invasion fleet. I’m not claiming you should care about my personal predictions because you care about my personal life. I’m claiming that my personal predictions are an equally good test of my calibration compared to my world news ones. If there were a person who could predict with perfect calibration how many leaves could fall from a given tree, I would argue that person was pretty likely to be a superforecaster in politics too. Calibration is not the same as predictive ability!
3. An individual who is both choosing what to predict and making predictions can trivially cheat – eg by predicting “the sun will come up January 1, 100%”, “the sun will come up January 2, 100%” and so on, and then also “Trump will win the election.” Then later they can say “I got 100 predictions in a row right at 100% confidence, surely you should believe me when I say Trump will win the election!”
Tetlock solves this by having the prediction-chooser be a different person than the prediction-makers. The prediction-chooser has no incentive to choose stupid things to predict, and the prediction-makers are all competing across the same set of questions – they only get recognized as a superforecaster if they beat others. Some scoring rules will give them benefits for making many predictions, others won’t.
I solve this problem by not doing the stupid dishonest thing that would ruin my experiment. Obviously people who aren’t me have to sort of take this on trust, but since I don’t win anything from my predictions, I don’t really care if you do or not. I guess you could also look at my predictions and see whether I’m doing something that has an obvious correct answer. I don’t think it does – even a relatively dumb prediction like “I will gain weight this year” doesn’t have a clear point at which I should be calibrated. IE I can’t “cheat” by knowing this will be a year I eat a lot, because the whole point of the exercise is to accurately express how much knowledge I have, so having more or less knowledge doesn’t give me an advantage. The main way for me to cheat would be to put down 90%, then roll a d10, eat a lot if it lands 1-9, and diet a lot if it lands 0. You have to trust I’m not doing that, but as long as you believe I’m not doing this specific pointless thing, the methodology stands on its own.
A lot of these seem to hinge on the difference between claiming strong predictive ability and claiming good calibration. For illustration, I claim no strong predictive ability on the question of what number a d20 will land on, but I claim perfect calibration on that same task – there’s a 5% chance it will land on each number. A person can be good at one without necessarily being good at another. I’m trying to measure calibration only in myself. Tetlock is trying to measure predictive ability (which naturally involves calibration) and so has to be slightly smarter about it – I think he meets that bar.
First thanks for taking time to read and respond. I appreciate it.
1- It’s possible we’re talking past each other, and if so it’s almost certainly my fault. Everything you say makes sense, and it goes on to give the impression of a well ordered world where iterating a 5% prediction over several decades ends up being the straightforward way to handle the problem and which further provides the payoff you expect (perhaps 5% of your portfolio is in way out of the money shorts which pay off in such a fashion that your investing return is consistent every year.) But the world is not well-ordered, particularly when it comes to the future. And it’s not the 5% probability events I worry about it’s the <1% gigantic impact events. First because I think this an area where small errors in assumptions lead to large errors in probability, second because it's an area that superforecasters largely ignore, and I worry that that sends a signal, even if it's subtle that other people can ignore these events as well. In essence I would basically agree that at the extreme tails of probability that you can't predict things very well. And part of my point is that an extreme emphasis on the accuracy of predictions means a shift of focus and incentives away from these high impact low-probability events.
2- Another fair point, and I think I mentioned (or meant to?) that it would be worth going back through your predictions to see whether there's a difference in calibration between the personal and the political. Which is to say you report your success (which is impressive) in the aggregate, but is there a significant difference between your success with the personal and your success with politics. Does your overall score look better because you're really good at predicting personal stuff and only mediocre at predicting world events? And the argument is that the superforecasting community at large does the same thing, they report their success in the aggregate, and is it possible that in the areas where the impact is the highest that their predictions are actually misleading? And the key point is that by misleading I mean do people and politicians end up acting in a worse way because of the existence of the superforecasters. Is there any possibility, particularly if the methodology becomes more widespread for someone to say, well the superforecasters only give >200,000 COVID cases a 2% probability, I’m going to cancel my meeting with the CDC and focus more on this other issue. As (admittedly thin) data Boris Johnsons skipped a lot of the early virus meetings:
https://www.axios.com/boris-johnson-skipped-five-virus-briefings-in-early-days-of-pandemic-968d6e1a-b89a-4ff1-b32d-580b8d51e7db.html
If superforecasting or related attempts to make the world more legible contributed to his inaction I want to identify that and stop it.
3- Certainly I am not claiming that you are intentionally cheating. And if there was any insinuation of that I apologize. As far as the Good Judgement project, the problem is with the unknown unknowns. Or rather the very little known unknowns. The questions that get chosen are all about things that are obvious and pressing, Iran, China, North Korea, the Euro Zone. There is an inevitable selection bias to the questions. I have seen no evidence that they even consider low probability events. I looked, and I’d be happy to be proven wrong here, but I could find no evidence that they considered the pandemic question before 2020. In large part that’s because I don’t think it plays well with the methodology. And this is the crux of the issue:
I contend that we spend too little time preparing for and considering rare high impact events, and that superforecasting exacerbates that problem.
That’s my argument.
P.S. Also I agree that in penance for whatever sins I may have committed that I will read Superforecasting and report back. Though I expect the issues I raised above will still stand.
P.P.S. Finally as some inside baseball. Everyone talks about how important titles are to posts, and this was an experiment in seeing if a more brazen title would generate more attention. And it did, though I’m not sure that’s a good thing it’s hard to say. I have actually changed “Ridiculousness” to “Limitations” and I expect I will avoid such headlines in the future, but it was an interesting experience.
You make a good point about the panicky people getting credit the one time that they are right and having everybody ignore the one time that they’re wrong. That said, you’re wrong about most everything else. Superforecasting is possible and fairly reliable: as proof I can simply point to the 8% stock gains I’ve already made in this so-called “unpredictable” market when everybody else is losing money. Or the fact that I regularly beat not only the market but also most hedge funds in my own investment strategy. The reason people don’t want to listen to superforecasters like me is because even though my analyses are far more accurate and profitable, they are based on the assumption that both sociology and economics are pseudoscience, and that’s a possibility that most people don’t want to consider. A lot of sociologists, economists, and the politicians and financial institutions who make policy based on these delusional sciences have too much status and reputation invested in these delusional pseudosciences to consider that they might be bullshit. Obviously you’re not going to be able to predict chemical reactions when you rely on pseudosciences such as alchemy rather than real sciences such as chemistry. That doesn’t mean that chemical reactions are unpredictable, it simply means that you’re using the wrong methodology. Likewise, you’re not going to be able to predict the future when you rely on pseudosciences such as sociology or economics rather than REAL sciences such as game theory or memetics.
Well I wish you luck in your career as hedge fund billionaire, and ask only that you stop back to let me know the fund you’re setting up which will consistently beat the market so I can put some money into it. A small amount, I’m betting you can’t beat the market, but I like to expose myself to positive black swans, and it sounds like that’s what you think you have.
Also as far as being up when everyone else is down. I’m up 250% after March by following the Talebian investment strategy, so I’m not super impressed by 8%. But I then I don’t beat the market all the time I just try to use the market to hedge everything else. What I really don’t want to do is have everything be cratering all at once. To lose my job (which I haven’t) and lose a bunch of money in the stock market at the same time.
Have to agree here, a ‘super forecaster’ hasn’t just made a good call in one event. Consider when the Dow drops 5,000 points that represents people who sold and people who brought. Someone out there sold before the drop and someone purchased at the bottom. They made out big but by itself that is no more interesting than knowing someone won the lottery.
I think the trigger here is less ‘super forecasting’. Asserting the risk of a pandemic rising each year would be good, but we had people who did that. We had Bill Gates, for example. We even had Trump appointees who said a ‘pandemic keeps them up at night’.
What would be highly useful would be:
* Getting the exact forecast down. Knowing this virus will be big while SARS/H1N1/MERS etc. were not.
* Maybe not predicting the virus is big but recognizing trouble when one sees it.
If you follow The Walking Dead you might have also followed their spinoff series Fear the Walking Dead. In the first season or two they did the world as the zombie outbreak started slowly. There was a character in there, a black guy with expensive tastes, he had no special knowledge of what was going to happen but he did recognize the dynamics of trouble coming including when to rely upon society and when to see society’s days were numbered. That type of judgment would be very valuable but….not sure you can quantify it.
There are several claims in this post I disagree with, but I’ll focus just on this:
“As far as I can tell there were no advance predictions of the probability of a pandemic by anyone following the Tetlockian methodology, say in 2019 or earlier.”
I think this is mostly because probabilistic crowdcasting, including crowdcasting using highly selected forecasters such as Good Judgment’s “superforecasters” or HyperMind’s “champion forecasters,” has only rarely been applied to forecasts with >2yr time horizons so far, and almost nobody is paying to have such forecasts generated for dozens of high-consequence but low-annual-probability events in an ongoing way.
However, in this case, by sheer luck, there was in fact a probabilistic crowdcast of medium-range pandemic risk, on a platform built by some EA/longtermist Tetlockians, with many of the forecasters being EA/longtermist Tetlockians, and with the forecasting question crafted in partnership with the Center for Existential Risk and and The Future of Life Institute: https://www.metaculus.com/questions/247/pandemic-series-a-major-naturally-originated-pandemic-by-2026/
As I explained in a LessWrong comment: “From 2016 through Jan 1st 2020, Metaculus users made forecasts about whether there would be a large pandemic (≥100M infections or ≥10M deaths in a 12mo period) by 2026. For most of the question’s history, the median forecast was 10%-25%, and the special Metaculus aggregated forecast was around 35%. At first this sounded high to me, but then someone pointed out that 4 pandemics from the previous 100 years qualified (I didn’t double-check this), suggesting a base rate of 40% chance per decade. So the median and aggregated forecasts on Metaculus were actually lower than the naive base rate (maybe by accident, or maybe forecasters adjusted downward because we have better surveillance and mitigation tools today?), but I’m guessing still higher than the probabilities that would’ve been given by most policymakers and journalists if they were in the habit of making quantified falsifiable forecasts.”
So in this case, a probabilistic crowdcast organized and forecasted by Tetlockians successfully predicted that a very high-consequence global pandemic was fairly plausible in the 2016-2026 decade. Moreover, while it’s unclear what reasoning was most salient to the participating forecasters, they at least *could* have produced something very similar to the Metaculus-aggregated forecast *precisely* by predicting the future would be the same as the past.
Thanks for producing the SSC Podcast–I found that service valuable.
My thoughts on this article are very critical.
I think the article got a lot of things wrong, such as:
> Accordingly, Alexander does a near perfect job of not only making predictions but assigning a confidence level to each of them.
Why do you think this? It seems to me that Scott Alexander (and every other human today) is far from being able to come close to making perfect predictions.
> as evidenced by the 3% who correctly predicted the magnitude of the pandemic I mentioned above
This is a common basic mistake in understanding what 3% means. “Hence, the percentages refer not to what portion of Superforecasters think a given bucket of outcomes is most likely, but how likely they think each bucket is to occur” (https://goodjudgment.io/covid/dashboard/).
> Well, the goal of the Good Judgement project and similar efforts is to improve forecasting and predictions specifically by increasing the proportion of accurate predictions.
I don’t actually know that this is definitely wrong, but given your other errors your saying this does not cause me to believe you. Could you provide a citation for this? What does it even mean to “increase the proportion of accurate predictions”? What qualifies as an “accurate prediction”?
> So by that metric the superforecasters failed, something both Alexander and I agree on
Why do you think that meant the superforecasters failed? It sounds to me like you are committing what Tetlock calls the wrong-side-of-maybe fallacy. (I’m not sure whether Scott Alexander actually agrees with you or not. When he said “they got it wrong” that may just be his way of saying “their forecast was on the “wrong-side-of-maybe”, which doesn’t necessarily make it a bad/wrong forecast.)
> When these predictions are graded, the ideal is for 90% of the 90% predictions to turn out to be true, not 95% or 85%, in the former case they were under confident and in the latter case they were overconfident.
What your describing is what it means to be well-calibrated. I don’t think what you said is an accurate description of what the ideal is–either (a) the ideal when forecasting, or (b) the ideal outcome for a forecaster given a forecast that they have already made. (If the ideal were really what you say, then one could achieve the “ideal” by forecasting 50% on all binary questions every time assuming there was no bias in the questions.) Re (a): The ideal when forecasting is to identify what is going to actually happen with certainty. Or, given that there is what Tetlock calls “irreducible uncertainty” in the world, the ideal is to identify what is going to happen as accurately as possible. E.g. For an event that is 90% likely to happen, so the ideal would be to ascertain this fact and make a forecast of 90%. Re: (b) The ideal is for 100% of events which one forecasted were 90% likely to happen to happen (this would yield one the lowest/best Brier score).
> That is the weakness of trying to maximize being correct. While being more right than wrong is certainly desirable. In general the few things the superforecasters end up being wrong about are far more consequential than all things they’re right about.
Again, it’s still unclear to me what your argument is for thinking that superforecasters got the things you mentioned wrong (or less wrong than others).
The title of this article felt like click bait to me and I don’t think the article itself provides justification for the title.
I only briefly skipped section 3 to the end. Something that stood out:
> SUPER forecasts telling us there was no need to worry.
Forecasts don’t include statements about whether or not it’s appropriate to worry about risks; only estimates of the probability of events occurring.
> In the leadup to the pandemic there were many people predicting that it could end up being a huge catastrophe (including Taleb, who said it to my face)
What probability is “could”? Suppose it was higher than 3% for precise question superforecasters gave 3% on at one point in time that you mentioned: How do you know Taleb was not overconfident? What if he always gives a higher probability than superforecasts for there being lots of pandemic deaths a month from the time the question is asked from him? He could very easily have a lower average Brier score long-term by doing that than the superforecasters. In other words, you’re not establishing that he has better judgment than the GJ superforecasters in aggregate, or that we ought to believe his forecasts, etc.
And also there’s plenty of room to agree or disagree about what actions are appropriate to take given various believed-probabilities of large numbers of pandemic deaths ina month.