How much data lies behind the paper you're reading?

Science consists of standing on the shoulders of giants: we run experiments, submit the results to expert peers, and publish the summaries so that others can use them as foundations for their own research. We base our careers on the validity and trust of our results.

But what happens when trust in science erodes?

Psychology, my field of study for nearly a decade, is facing precisely such a crisis.

The popular concept of power poses? Highly questionable.

Established findings such as implicit bias? Thoroughly suspect.

So what can we do to foster other’s trust in our findings?

One solution would be to give researchers access to the relevant raw data to analyze for themselves.

Unfortunately, that’s easier said than done.

In 2013, a group of researchers led by UBC’s Timothy Vine found that the availability of research data declines rapidly with article age.

100%

Studies from 1993 or prior had a lower than 10% likelihood of having accessible data

50%

38%

30%

27%

26%

24%

23%

19%

16%

7%

3%

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

100%

Studies from 1993 or prior had a lower than 10% likelihood of having accessible data

50%

38

30

27

26

24

23

19

16

7%

3

1991

1995

1999

2003

2007

2011

100%

Studies from 1993 or prior had a lower than 10% likelihood of having accessible data

50%

38

30

27

26

24

23

19

16

7%

3

1991

1995

1999

2003

2007

2011

But what does that mean in concrete terms? How much of our work, as scientists, consists of research whose data foundations have disappeared?

Let’s a take a look

Here’s a portion of my grad school thesis, submitted in 2013—the same year as Vine’s paper on data availability. Based on the Vine’s findings, let’s see how likely I would be to obtain the original data used in these studies.

Nearly every sentence in a literature review has a citation to some older study that readers can consult for evidence regarding the claim I’m making. I’ve color-coded each one: darker passages are more likely to have data behind them; lighter ones are likely missing the raw data which led to the findings in the first place.

Let’s remove the studies that have a high likelihood of having inaccessible data. The result? Much of what I’ve used as a foundation for two years of research disappears. Most of the data behind journal articles—in essence, short summaries of experiments—has, for all intents and purposes, vanished.

So what’s the fix to restoring some faith in scientific discovery? Apart from heightening researchers’ own awareness of data erosion, we should push for de facto submissions of raw, anonymized data (when ethically possible) to journals. In recent years, prestigious publications like the open-access Public Library of Science journals have played a key role in encouraging researchers to share their raw findings, giving anyone a chance to analyze them.

Making our data available will not only encourage other researchers to check the work, but would additionally provide more data for anyone who would like to incorporate previous findings into their own scientific investigations.

Methods

Findings in the 2013 paper on data availability pertain to biological data found in biology journals; the specific numbers will likely differ by discipline. Additionally, the likelihoods of data availability should not be taken as an indicator that any specific study lacks sufficient data-storage protocols, but rather as a rough indicator of directional relationship between a citation’s age and its data’s availability. You may find the thesis cited in this project here. Big ups to Jan Diehm for additional design and development.