Philip Tetlock's Alpha Pundit Challenge


In 2015, Philip Tetlock (of Superforecasting fame) asked the Open Philanthropy Project to fund two projects, one of which was called the Alpha Pundit Challenge:

"[The “Alpha Pundit Challenge”] would systematically convert vague predictions made by prominent pundits into explicit numerical forecasts. Both of these projects are in the early stages of planning, so the details have not been worked out, but they share the goal of encouraging public figures with strong positions on important issues to convert those positions into concrete forecasts" (Open Philanthropy Project, 2016).

I find this idea interesting. Here, I discuss why and how we might go about making such a project a reality.

Project status

I've reached out to both the Open Philanthropy Project and Tetlock's Good Judgment, Inc., but none of them have been able to share any updates.

Why should we improve forecasting?

Our public discussions are in one sense stuck in the pre-scientfic area – we don't systematically experiment with and track forecasts like we do when testing scientific hypotheses.

The aim of Tetlock's proposals is to improve public forecasts, and thereby improve institutional decision-making. Improving institutional decision-making is a worthwile goal to pursue: "Governments and other important institutions frequently have to make complex, high-stakes decisions based on the judgement calls of just a handful of people. […] It seems like developing and applying strategies that improve human judgement and decision-making could be very valuable" (Whittlestone, 2017).

Why quantify and keep score?

We need accurate feedback in order to improve. We also need incentives, for example in the form of payments or public scorekeeping. In the words of Tetlock & Gardner (2015), "[w]hat would help is a sweeping commitment to evaluation: Keep score. Analyze results. Learn what works and what doesn't".

In practice

How could we do this in practice? In essence, the process would consist in the following five steps.

  1. Gather
  2. Establish yardstick
  3. Quantify probability
  4. Wait
  5. Record after the fact
  6. Score


As the first step, we would need to find forecasts. This would typically be in news media, corporate publications, etc.

Establish yardstick

What would have to happen for the forecast to be correct? For example, if a forecaster predicts a war in the next two years, we would need to define "war" and set a target date.


Next, we convert the forecast to an unambiguous numerical one:

"For instance, former Treasury Secretary Larry Summers recently published an important essay on global secular stagnation in the Washington Post which included a series of embedded forecasts, such as this prediction about inflation and central bank policies: "The risks tilt heavily toward inflation rates below official targets." It is a catchy verbal salvo, but just what it means is open to interpretation. Our panel assigned a range of 70–99% to that forecast, centering on 85%. When asked that same question, the Superforecasters give a probability of 72%. These precise forecasts can now be evaluated against reality" (Tetlock, 2015).


Some forecasts may be over a short timespan (e.g. close to an election), and some may be farther out. In either case, we have to endure some waiting time.

Record after the fact

When we have the correct answer, we can evaluate the forecast.


The last step would be to recalculate the forecaster's Brier score:

"[…] Brier scores measure the distance between what you forecast and what actually happened. So Brier scores are like golf scores: lower is better. Perfection is 0. A hedged fifty-fifty call, or random guessing in the aggregate, will produce a Brier score of 0.5. A forecast that is wrong to the greatest possible extent—saying there is a 100% chance that something will happen and it doesn't, every time—scores a disastrous 2.0, as far from The Truth as it is possible to get" (Tetlock & Gardner, 2015).

Success criteria

I've thought a bit about what it would take to succeed in the sense of improving the public debate.

Good conversions

Whenever we encounter a vague forecast, we need to create a yardstick and a quantitative probability that is close to what the forecaster meant when making the prediction. We could also allow forecasters to adjust our conversions, effectively achieving the goal of making them commit to testable forecasts.

Choosing the right forecasts at the right times

Some criteria for choosing forecasts could be:

  1. Timeframes (shorter is better)
  2. Public interest (more is better)
  3. Ease of conversion (easier is better)
  4. Importance (more is better)


We would have to get some attention from the public and decision-makers in order to make an impact.


Allow underdogs to participate

We would only be able to convert so many forecasts manually, so we would have to prioritize. Forecasters no one has heard of are unlikely to have their forecasts converted. But we could allow them to input their forecasts directly, competing with the famous forecasters.

This way, we could expose some diamonds in the rough, like the so-called superforecasters (Tetlock & Gardner, 2015, p. 3).

Aggregate predictions

A fairly obvious opportunity to get value from the predictions is to aggregate them to form a "consensus" forecast which decision-makers could use.

Make data easy to use

Others could build on our data. News media could, for example, include forecasters' scores in their coverage of them.

Conditionals can provide recommendations

We could also record conditional forecasts, in the form of "if X, then the probability of Y is Z." For example, someone may suggest additional teacher education. What's the probability of increased test scores after such a measure? (A system for governing based on such forecasts – only with prediction markets – has been proposed by Hanson (2013).)

Forecasting as a service

Good forecasts can be valuable to actors with deep pockets. For example, European banks may be interested in knowing how and when the European Union's payment services directive will be enforced.

Conclusion and next steps

I'm entertaining the idea of trying such a project myself, most likely in my home country of Norway – I think media access and knowledge of the forecasted questions are essential components.

A possible next step could be, in lean startup fashion, to figure out what our critical hypotheses are. For example, would people care enough? Are we able to find or build the necessary software quickly and cheaply enough?