An experiment from

We found this cool study about randomness that we wanted to show you.

But…there is something we saw in the data that made us question the results. Can you help us reproduce the study and figure things out?

Created by
Russell Samora
Arjun Kakkar

This experiment investigates how humans produce and perceive randomness. It consists of three very short tasks related to chance. A suggested pace is one tap per second. ⚠️ You can’t undo or go back.

Challenge 1 of 3

Tap a sequence of 12 coin flips. Make it look as random as possible; another person should not be able to tell if you made it up, or if it was from real coin flips. You have 12 choices left.

Previous Choice: n/a

Challenge 2 of 3

Tap a sequence of 10 dice rolls. Make it look as random as possible; another person should not be able to tell if you made it up or if it was from real dice rolls. You have 10 choices left.

Previous Choice: n/a

Challenge 3 of 3

Tap a sequence of 10 spots. Make it look as random as possible; another person should not be able to tell if you made it up or if it was randomly generated. You have 10 choices left.

Previous Choice: n/a

Thanks for playing! Based on your answers we think you are.... How old are you really? For science.*

0

*We are no longer storing user responses since we have received enough! Tap for more details.

Previously, only your results and age were saved to a database. No personally identifiable info will be collected or stored. Still have doubts? Check out our privacy policy (we don’t use any trackers). And if you have access, ask your favorite “techie” to confirm this isn’t sketchy.

The study received coverage across dozens of outlets. The headline: A person’s ability to be random peaks around 25 years old and declines after 60.

The researchers made their data and methods public so we explored the idea of making an age guessing game. This unearthed some questions for us, so the story became about the replication crisis; the ongoing concern that it is hard to reproduce many scientific studies.

One 2015 attempt to reproduce 100 psychology studies was able to replicate only 39 of them.

—Kelsey Piper, Vox

We think the findings of the study are at the mercy of a single decision the researchers made to filter out questionable responses. To us, this meant the participant either misunderstood the instructions, or intentionally subverted the experiment.

This filtering decision was the difference between results that lead to intriguing headlines, and those that produce no trend and no paper. To be clear, this wasn’t about negligence or malicious intent, but rather the unavoidable personal opinions that inform methodologies. Our opinions diverge at this point, so we wanted to break down its consequences more deeply.

Let’s dive in. Here is the wording of the instructions from in the original study:

Click on a number between one and six as randomly as possible to produce the kind of sequence you’d get if you really rolled a die, so that if another person is shown your sequence of digits from 1 to 6, he/she should not be able to tell whether these numbers were produced by a real die or just “made up” by somebody.

And here are examples of a good and questionable dice roll sequence based on our understanding of their instructions.

3 1 5 6 2 6 3 4 4 1

1 1 1 1 1 1 1 1 1 1

Do you think the questionable response looks genuinely random and satisfies the instructions? We don’t. The researchers behind the study did (and confirmed it via email). And this disagreement has a big statistical impact.

Here is the key chart from the study that shows the developmental curve of complexity for the coin toss experiment.

Coin Toss Complexity Scores

Think of the complexity score as a proxy for randomness; higher up means more random, lower down means less random. Here are a few examples (H for heads, T for tails).

When we add a trend line, we can see the age drop-off finding take shape.

At the bottom are the questionable responses. We even took a conservative approach to defining questionable; just the sequences that were uniformly heads or uniformly tails.

When we exclude the questionable responses, the declining trend line disappears.

That breakdown was just for the coin toss, but the downward curve looks the same for the aggregate results across all three tasks.

All Tasks Complexity Scores

The study was peer-reviewed, but to our knowledge it hasn’t been reproduced. This is typical, since reproduction takes a significant amount of time and effort on the part of other scientists busy conducting their own research.

We aren’t peers, but here is our review: There are two camps around the decision to filter out questionable responses. We believe that there are obvious responses that are candidates for removal that make the data more true. The researchers believe that you can only analyze the raw responses because, statistically, any sequence is equally likely to occur, so where do you draw the line? To them, filtering any data would be tampering with the results.

What do you think about these two takes on the decision? Regardless of your opinion (or ours or theirs), we wanted to make it explicit that no study is black-and-white, and it is always worth exploring those gray areas.

The experiment at the start of this story was our attempt at our own abridged reproduction. Here are results of that process from reader responses.

All Tasks Complexity Scores (Ours)

Exclude users with 1+ uniform responses

It looks like our hypothesis was right; the trend line is basically flat, regardless of filtering out the uniform responses. In our experiment, we aren’t seeing evidence supporting a correlation between age and randomness.

As for our initial idea to make an age-guessing game, we have guessed right 0% of the time. Not as good as we had expected 😔.

The moral of the story: just because you read a New York Times headline about a study doesn’t mean that its findings are replicable, so stay curious. If you enjoyed this breakdown, consider supporting The Pudding and more stories like this on Patreon.

Our Methods

We stopped collecting data on August 1, 2022 after receiving over 50,000 responses.

The calculation of algorithmic complexity is done using the CTM method. In practice, we use a custom implementation of the acss R package to obtain this complexity for given user inputs and compute an overall score.

For the task level data, we use the original dataset of the study and use a cutoff of non-uniform responses to remove questionable ones. Since there are many kinds of uniform responses, we use a specific z-score of complexity for each task below which all responses are uniform.

We then overlay this data with a trend line calculated using the LOESS (locally estimated scatterplot smoothing) method. The analysis presented in the paper used the LOESS function from R with no robustness iterations (the default) resulting in sensitivity to outliers. Using a LOESS function with even one robustness iteration results in a similar flattening of the trend line as presented in this article.

The original study asked users to perform five tasks, but since the results are basically the same across the board, we decided to reduce our experiment to three tasks because of attention spans (not yours, it is exceptional if you are reading this).

Here is the data from our experiment if you want to explore it yourself.