Modular empirical science

Published 2020-04-23.

In this article, I suggest a process for empirical science. It is an alternative to the peer-reviewed journal. The overarching goal is scientific progress, and I think this is one way to accelerate it. My suggestion is inspired by modular design, the open science movement and the success of free and open-source software.

Epistemic status

80 percent confident that it would produce more scientific progress given the same input.
70 percent sure that the idea is not objectively original. I certainly have at least parts of it from other people, and at least parts of it is already being done.
70 percent confident that I haven’t found the best name for the concept.
Many implementation details are unknown to me.

The problem

Imagine that you are a scientist, and you have dreamed up an experiment to answer a question. To illustrate, let us use the hilarious Stanford marshmallow experiment:

The question is, is there a relation between children’s ability to delay gratification and how well their lives turn out?

The current process for doing this research is approximately:

Design the experiment
Conduct it
Analyze the results
Write an article
Submit the article to a journal for peer review
Make changes to the article based on reviewer feedback
Re-submit to the journal
Repeat steps 5-7 with different journals until one decides to publish (or you give up, in which case the process ends here)
Others make reproductions, replications and variations on the original study (these involve some of or all the steps outlined above)
Making meta-analyses of all the studies
More reproductions, replications and variations and meta-analyses

(There may be additional steps, such as pre-registration.)

This process has helped us make a lot of progress. But, as you may be aware of, it has quite a few problems:

In order to publish a (non-meta) paper, you need to do a range of different tasks — experiment design and execution, data analysis, writing, and so on. If you aren’t great at all of them (few are), you need to collaborate with others or risk doing shoddy work. Collaboration can be rewarding, but also time-consuming and frustrating. Also, it is hard to keep track of who did exactly what.
Making a meta-analysis is also a lot of non-trivial work: You have to find, read and make sense of all the studies. You need to make non-obvious judgment calls about the relevance and quality of different papers.
Peer review as a quality assurance process has a lot of merit. But it can also be extremely frustrating and time-consuming, and it can prevent novel and valuable ideas from being published. And it usually is kept in the dark, making it anywhere from hard to impossible for curious minds to learn about the discussion.
Wanting to publish only interesting results, journals introduce publication bias, encourage shady statistical methods, etc.

All in all, the process is slow, cumbersome and error-prone, and it makes for a lot of duplicated effort. It is not disastrously bad, but I think we can do a lot better.

An improvement

As mentioned above, imagine that you have invented the marshmallow experiment, and you want to try it. Instead of the traditional process described above, you follow this procedure:

Design the experiment
Establish a database
Prepare the analysis
Conduct the experiment and add the data to the database
Analyze the results
Repeat and refine an arbitrary set of steps 1-4

Design the experiment

As a proxy for life outcomes, you decide to use SAT scores at age 17. (You could use e.g. educational attainment instead, but waiting until the kids are 17 is bad enough.) You

write up the details of how to conduct the experiment in a plain text file;
publish it online in a version control repository; and
share it with others you think might be interested.

Jane, a colleague of yours, reads your description. She notices that there is missing some detail in the instructions for the experimenters. Jane writes up her suggested changes. (In software development, this is usually called a pull request.) You take a look, and agree that the changes would improve the design. Therefore, you update it to include her suggestions.

Establish a database

At the very least, you need a table of observations. To start, it could have the following columns.

Subject
Marshmallow experiment result (pass or fail)
SAT score at age 17

You want to put the database, too, online. But the data is sensitive. So you make a second table of subjects, which only you will have access to. Then, in the table of observations, you reference the subject by ID.

At this point, it makes sense to make a simple website with the experiment design and the database. You put the (currently empty) database online and tell others about it.

Bob, a researcher at a different university, comes across your published experiment design and database. He thinks that the way you encode experiment results is needlessly limited. Instead of a pass or fail, he suggests that you use the number of seconds before the first bite of the marshmallow. Again, you agree and accept his suggestion, updating both the experiment design and the database design.

Prepare the analysis

Now that you know how the data is going to look, it makes sense to prepare the analysis. You write a small program to make a scatter plot and do a linear regression. You share that, too.

John, a statistics student, notices a bug: You made an off-by-one error which would leave out the last observation. He writes a change request, and you gratefully accept it.

Conduct the experiment and add the data to the database

Finally, you get to the fun part of seeing kids wrestle with the marshmallow dilemma. When you are done, you put the data into the database. The first four observations might look something like this:

Subject ID	Marshmallow test result (seconds)	SAT score at age 17
1	600	`NULL`
2	87	`NULL`
3	341	`NULL`
4	600	`NULL`
…	…	…

It is not very useful yet, but you figure that it doesn’t hurt to put the data out there. Next, you need to wait about 12 years to get the kids’ SAT scores. That’s a long time, but you are a patient scientist.

Fast-forward 12 years. You finally have all the data you need, and you add the missing SAT scores to the database:

Subject ID	Marshmallow test result (seconds)	SAT score at age 17
1	600	1000
2	87	600
3	341	1100
4	600	1400
…	…	…

Analyze the results

You run the analysis program that John helped you perfect. You find a moderate positive correlation, and put the results on the project website.

Repeat and refine

So far, you have conducted a study in an unusually open way. Also, you have used some modern tools — a relational database, a version control system and a website. They make the work easier and more structured. They also give everyone a fairly clear and transparent record of who contributed what. But so far, the process you have followed isn’t that novel. The magic happens when the next scientist comes along.

A reproduction

Her name is Alice. She wants to reproduce the experiment. You think this is a great idea. I mean, of course you would, right? Riiight? Anyway, you point her to the experiment design, and she starts her work.

In the meantime, you re-consider your database design. As long as there was only one experiment, you could get away with some very minimal tables. But now that a second researcher is going to contribute, you need to step up your game.

First, you configure the database so that Alice has access to add rows to the tables of observations and subjects. She can’t, however, see the subjects you have added.

Second, you need to add some more columns to the table of observations. With only your experiment, a lot of information was similar for all observations. Therefore, it was fine to leave a lot of it out of the database. Now, you need you add some more columns, e.g. the experiment date, what researcher was responsible, the location, and so on.

You also set up your analysis to update automatically whenever new observations are added.

Alice conducts the first part of the experiment, and adds her data. While she is waiting for the kids to turn 17, you add a second analysis program to see if there are any changes in marshmallow experiment scores over time. This opportunity is one benefit from adding data as soon as it is available.

Fast-forward another 12 years, and Alice’s part of the data set is complete. The updated analysis shows a slightly smaller but more statistically significant effect size. This is great. You are now more confident in the finding.

Controlling for things

A while later, John shares a thought with you: Maybe the effects you have been seeing aren’t entirely due to the subjects’ ability to delay gratification? For example, some kids get more SAT tutoring than others. Both you and Alice agree that John might be right. So you add yet another column, this time to account for the amount of tutoring. While you and Alice gather data on SAT tutoring, John makes a new version of the analysis script, controlling for the new variable.

When the data is in, the analysis is re-run automatically. It turns out that John was right again: The effect size is smaller when controlling for the amount of SAT tutoring.

The saga continues…

As you can probably tell by now, this is a never-ending story:

As the dataset grows in size and breadth, you might want to normalize the database to reduce duplication.
If the experiment design changes over time, you need to keep track of the experiment design used to obtain a given observation.
Some columns will have missing data. For example, if someone conducts the experiment outside the US, they will probably not have SAT scores. (They will need to use something else.)
As time passes, you might want to do follow-up studies. You can e.g. gather better outcome data as the subjects age.
The analysis program has to account for all these changes. It makes sense to make multiple analyses. For example, you might discover an issue with the original experiment design, making it useful to have an analysis that excludes observations obtained using the faulty experiment design.
As the number of contributors increases, you need more sophisticated access control.
You might want to improve quality control. For example, you could have an experiment audit procedure.
Compared to scientific journals, your website gives you many more opportunities to visualize data and make the research useful for people. You could, for example, make a guide for parents and kindergarten teachers on how to conduct their own experiments. Then, you could add a module to calculate and visualize the expected outcomes based on the experiment results.
When you are no longer willing or able to lead the project, you need to hand it over to someone else.

If scientists worked like this, I think we would see less duplication of work — e.g. more shared data models and analysis programs — and more specialization among researchers. For example, Alice might be great at conducting experiments, while John might have an edge in data analysis. With this process, they can contribute to scientific progress in much smaller pieces. It is, in other words, more modular than the journal-driven process.

As may be obvious by now, the meta-analysis is an integrated part of this process: All the studies are continuously aggregated. However, you might still want to do a meta-analysis of all the research that looks at ability to delay gratification and life outcomes. After all, there are ways to measure this ability. (For example, you could ask teachers to rate students.) The meta-analysis would, however, be easier because of the database and analysis programs already available.

The need for a meta-analysis raises an interesting question: Are you using the right entity for your database? In other words, should the database encompass only one type of experiment or should it encompass an entire hypothesis? If so, you would add more independent variables, such as asking teachers about their pupils’ ability to delay gratification. (As mentioned above, you have already added more dependent variables.) We can even go further by including data for other hypotheses such as the relation between IQ scores and life outcomes.

There are probably many trade-offs involved. For example, having an all-encompassing database would enable some interesting analyses. But it would also most likely be hard to navigate and a nightmare to manage. It is not obvious to me where the cut-off should be.

Also, this won’t magically solve all the problems with the existing process mentioned above. For example, someone would need to have the final say in what contributions are accepted. People will disagree, and sometimes the gatekeepers will be wrong. But, if implemented properly, the disagreements would at least be open for everyone to see. That way, the public can make up their own minds, and unreasonable gatekeepers run a greater risk of being exposed.

If this is so smart, why isn’t everyone doing it?

“Okay,” you might be thinking. “This sounds like a good idea and all, but why isn’t everyone doing it? Scientists tend to be smart people that really care about scientific progress. Surely, they must be doing all the right things already?”

Scientific progress is, at least to a large extent, a public good:

Public good, in economics, a product or service that is non-excludable and nondepletable (or “non-rivalrous”).

A good is non-excludable if one cannot exclude individuals from enjoying its benefits when the good is provided. A good is nondepletable if one individual’s enjoyment of the good does not diminish the amount of the good available to others. For example, clean air is (for all practical purposes) a public good, because its use by one individual does not (for all practical purposes) deplete the stock available to other individuals, and there is no way to exclude an individual from consuming it, if it exists.

What does it mean for supply when a good is public?

Public goods (and bads) are textbook examples of goods that the market typically undersupplies (or oversupplies in the case of public bads).

To correct for this problem, governments and non-profits have established a kind of pseudo-market for scientific progress. But there are many reasons to expect such a market to not be perfect:

Scientific progress is really hard to measure. Therefore, funders struggle to incentivize researchers optimally.
Even governments and non-profits can’t entirely escape the public good problem. For example, countries will be tempted to free-ride on other countries’ science funding. As a result, we should expect science to be underfunded.
People in charge of funding have their own incentives. For example, the leader of a non-profit may need to impress donors in the short term.

Even if the incentives were all aligned, not all opportunities to make progress would be exploited, at least not immediately:

… [O]verly optimistic economic models have often assumed that demand and incentives are enough to stimulate the production of any product. Incentives work to motivate intermediaries and traders, but makers, who are the ones that provide the substance of what is traded, need more than an incentive to make something. They need to know how to do it. (emphasis mine)

— César Hidalgo, Why Information Grows, p. 77-78.

In other words, for modular empirical science to happen, the people involved would not only need the motivation to do it. They would also need to know how to do it — how to work with databases, web development and version control systems. They would also need their data to be of the kind that can be shared. And this would have needed to happen in the fairly short time all the necessary tools have been available and of sufficient quality.

For all these reasons, I am confident that there are many opportunities out there for accelerating scientific progress. And I think the idea we have been discussing here is one of them. (For a thorough discussion on how to tell if you can do something unusually well, see Eliezer Yudkowski’s Inadequate Equilibria.)

Closing thoughts and request for comments

In this article, I have presented a suggestion for how we can improve experimental science by moving from a journal paradigm to a modular paradigm of open-source software and (at least partially) open databases.

There are many questions I have left out, both for brevity and because I don’t have all the answers. For example:

Why hasn’t this been done (to a larger extent) already?
How can a transition be made? One dimension to consider is bottom-up versus top-down. I think bottom-up would be better. But we can’t ignore incentives — and they are set from the top.
What aspects of the existing paradigm will we miss if we change? (For a full discussion, see Chesterton’s Fence.)
Details will differ by field and method. Randomized controlled trials, for example, are different from studies such as the marshmallow experiment.
There are clearly disadvantages to the approach I am suggesting. For example, we have learned from open-source software development that maintaining projects can be very hard work, and many burn out.

I would love to hear from you:

Have I made any mistakes or misunderstood anything?
Is anything unclear, or do you have suggestions for improvement?
Who, if any, are working like this already?
Who could be interested in working on or funding work on this? For example, I could help willing scientists with the necessary training, software development and infrastructure operations.

Thank you for reading. You can comment on