Skip to main content
Help fund us
C o n t i n u e , , P i v o t o r P u t I t D o w n

The Pudding’s process to go from idea to data story

When I started working at The Pudding, I was introduced to the concept of an “idea backlog”, or a place to keep track of the random thoughts, questions, and inspirations about the world that I notice in my day-to-day life. Every so often, I combine my new ideas with my teammates’ in our team backlog. Many of these ideas will never lead anywhere, but some of them we’ll dust off, explore, shape, and just maybe, develop into a full-blown visual essay on The Pudding.

Just because an idea has been plucked from our backlog, doesn’t mean it will ever see the light of day. Some ideas are abandoned before they’re even really started, and others are left on the cutting room floor much later in the process. Still others are not scrapped entirely, but are so drastically changed somewhere between idea conception and publication that the finished project and its starting point share little resemblance.

For much of my life, changing something or putting it down, even if it no longer felt right, seemed like a form of failure. Of course, it wasn’t. Making conscious decisions for when to continue in one direction, when to pivot, and when to leave an idea behind make my published stories so much better. This is a process we practice with every story we write at The Pudding and a process we want to share. So here is a rough idea of the cross-roads we reach between idea and published visual essay.

Do you have a unique question? WRITING A DATA-DRIVEN STORY yes can it be answered with data? come up with a new question come up with a new question do the data exist? no not unique, but it adds to the conversation yes no yes no not sure ask friends or family yes no start there PUT IT DOWN* PUBLISH IT are you the right person to tell this story? yes no yes yup! no fix it no try again collaborate with someone who is create a plan for your data story. is it still interesting? nice, make the thing! is it still interesting? can you collect the data yourself? is it ethical to collect or use these data? are the results of the analysis interesting? maybe not... definitely find a new data source yes no * either temporarily or permanently

You may notice that there are a lot more paths that lead to put it down than there are that lead to publish it. That doesn't mean that we actually scrap more stories than we publish. It just means that saying goodbye (even for a period of time), pivoting directions, and continuing forward are all acceptable options at any of these junctures.

Let's go through some examples of decisions our team has made when stories reach each crossroad.

Do you have a unique question & can it be answered with data?

Since these first two junctures go hand-in-hand, I'll address them together. At The Pudding, we tend to start our process of creating a data story with a question. This helps us scope the story, not get buried under the size of it, and to make sure that we can move on to the next question: finding data to help us answer our particular question. Every once in a while, we move forward stories that didn't specifically start with a question (like our Internet Boy Band Database), but we make a conscious decision to move that forward, forgoing the question aspect.

Once you've got a question in mind, figure out if it's unique. Has someone asked and answered this question? If not, continue on. If they have, can you tell the story in a new or different way? For instance, a few years ago we wrote a story about access to abortion clinics throughout the US. This is a story that has been told before, but we wanted to tell the story using new data: driving times to abortion clinics. Because this was a new angle, we considered it a "unique" question and moved forward.

Next, our question needs to be able to be answered with data. In our story about the ineffectiveness of women's pockets, Jan and I could have started with a question like "why are women's pockets so small?". That technically is a question, but it is one that is harder to answer with data. Reframing the question to "how much smaller are women's pockets than men's?" or "are women's jean pockets smaller than men's across the board?" gives us a question that can be answered by data and allows us to move forward.

Stories we’ve put down at this stage:

  • A few years ago, Matt and Jan were mesmerized by the YouTube videos that Summoning Salt was creating about video game speedruns (i.e., trying to beat either a level of or an entire video game as fast as possible) and thought there might be a way to collaborate to make something even more data viz-y. But after collecting the data and doing some storyboarding, they couldn’t find a way to tell the story that added to the conversation  — kind of an “if it ain’t broke, don’t fix it” dead end.

Do the data exist?

There are many sources of open, available data that exist on the internet. Things like kaggle and data.world provide a wide variety of datasets. There are also things (US-based) like the CDC or other government data for your country, state, or city. Sometimes, these sources work well and can effectively be used to answer the question we had (check out our use of CDC data in this story about birth control or our use of Chicago prosecution data in a story about Kim Foxx).

But, sometimes pre-existing data sources don't help you answer your question. When you use data for a purpose other than the one it was collected for, you need to assess whether your use and interpretation of the data is accurate. Sometimes you can change your question to fit the data. But otherwise, in cases where pre-existing data won't work for you, you may be able to collect your own data. We create our own datasets frequently. Sometimes, we can create datasets from open sources like Wikipedia (check out several of our stories that utilize Wiki data) TripAdvisor, or PetFinder.

Other times we need to manually collect the data ourselves. To find out how much smaller women's pockets are than men's, Jan and I physically went to stores to measure pockets and recorded our results. To find the appearance of each boyband member in their band's music video, we enlisted the help of several volunteers to watch each music video and record specific details about each person. Two other editorial assistants and I manually read through hundreds of high school dress codes to figure out which items were prohibited in schools nationwide. Manually collecting data can be time consuming and the data may not pan out the way you expect, so be prepared to sink a lot of time here if you go this route.

Stories we’ve put down at this stage:

  • Russell had pitched an idea about the use of bananas in cookbook recipes over time. Unfortunately, after lots of digging, we could only find either modern (within the last 10 years or so) or very old (pre-1900's) cookbooks that were digitized. Many of these didn't have data in a usable or structured format and we lost an entire century of data. We put the story down for a few months but have recently found another data source that may help us to answer this question. Perhaps we’ll pick it back up again!

Is it ethical to collect or use these data?

This is a tricky one to define. Legally, any data that is publicly available can be scraped or collected. There are some caveats (e.g., not for unlimited commercial use, not on sites that require authentication, not copyrighted creative content like photos or videos etc.), but otherwise in the US, this is typically considered fair use.

That being said, there are plenty of data that can be collected and used legally but ethically probably shouldn't be collected or used. For instance, as the COVID-19 pandemic continues worldwide, Amanda Makulec, a data visualization designer and public health professional, has asked the data visualization community to "#vizresponsibly - which may mean not publishing your visualizations in the public domain at all". Even though a lot of data on this topic are publicly available, it is very easy to mislead or not fully understand such complex data in an ever-evolving situation. And if you do move it forward, keep in mind the ethics of your story and presentation choices as your work may have unintended consequences.

Similarly, in 2018, data science consultant Lynn Cherny reflected on a data science project she started, spent a lot of time on, and decided to stop working on. She set out to make a story generator based on the 1928 writer’s manual, “Plotto”. Technically, she was able to do this. The data were freely accessible and she had the skills necessary to use the data to create a generator. But upon analyzing the data, she found it to be antiquated at best, and both sexist and racist at worst. Even though she had already done a lot of work on this project, she decided to put it down and not release any of her materials publicly.  “As a technologist, I’m choosing not to amplify Cook’s old-fashioned voice and viewpoint by not publishing the derived dataset or the code I created to tell his stories.”

Stories we’ve put down at this stage:

  • Florida Man. The idea that there is always someone in Florida doing something seemingly silly is a strong one in the US and we had a few ideas that we considered pursuing along this topic. The data (news headlines & articles) were open and available to us for use. But after investigating the issue and finding that the Florida Man stereotype only existed because of transparency laws in Florida and is at its core an "anti-poor, mental health-shaming" meme, we decided to scrap the story. We have no intention of picking this story back up.

Are the results of the analysis interesting?

So, at this point, you have a question that can be answered by data. You've collected the data, analyzed it, and you (hopefully) have some sort of answer to your question. Here comes the tricky part: is this interesting? Was the answer surprising? Or what you expected? Would someone else be interested in hearing you talk about what you found (if you're unsure, try talking to someone about it). Here are a few examples that may help you figure out how we treat this crossroads:

Stories we pivoted at this stage:

  • Our Hipster Summer Reading List originally started as a question about which books from libraries are the most checked out. We wanted to know if the most popular 1% of books made up for 90% of check-outs or something similar. We used open-source check-out data from the Seattle Public Library, analyzed it, and found that 20% of checkouts were due to the most popular 1% of books. Which... is kinda interesting? We played with several questions here and almost put down the story before deciding to flip the question on its head and ask "which books haven't been checked out in decades?" Framing it as a reading list for people who want the books no one is reading gave it the spark that we were looking for.
  • One of our most engaging stories, a deep dive into Ali Wong's comedy routine was actually put down for a few months after the manual data collection was completed. The original story was more focused on comparing the structure of different comedians’ stand-up routines. Russell (the original creator of the idea) said he set it aside “purely because [he] had been working on it for too long, so [he] decided to give it some breathing room”. When colleague Matt came up with a new angle for the story (focusing it around the laughter climax), the two picked the story back up and developed it. (Hear Matt and Russell give a behind the scenes look at this project here.)

Stories we put down at this stage:

  • I tease my partner often about how his wardrobe is like that of Lego Batman; it only consists of black, and very, very dark gray. After stumbling upon a Twitter thread lamenting over the blandness that is men's wardrobe, I was curious about whether this was an issue at the pipeline (i.e., stores don't sell colorful clothes for men) or from the buyer (i.e., men don't buy colorful clothes, even when offered). I decided to find the color options for the top 600 best selling t-shirts for men and women on Macys.com and found... really no huge difference in what is being offered. The findings felt bland and uninteresting, so we said goodbye to this piece. Unless I come up with a new twist on this question, I don’t intend to pick this story back up.

Create a plan for your story; is it still interesting?

Sometimes the results that you've found are interesting, but it's hard finding the right way to present them. At The Pudding, we often use storyboards to roughly sketch out how we want our story to go. Then, we share the storyboards with our team, seeing if other people find the flow of information and our planned presentation of the data engaging. Sometimes, the frame just isn't right, but it can be fixed. Other times, there may not be a good frame for it.

Stories we’ve pivoted at this stage:

  • In telling a story about the weather on Mars, my initial storyboard was designed as a faux welcome packet for humans hoping to live there. But when presented to the team, the idea felt flat. Wouldn't residents have already known about the weather before going there? Without a point of comparison, these numbers lose their interest. Instead, I decided to frame the same data around imagined postcards sent to Earth from the Curiosity rover on Mars. Curiosity could then explain the weather on Mars using the reader's location for a point of comparison. Even though the data and question remained the same, the pivot in the story’s frame saved this story from being scrapped.

Make the thing; is it still interesting?

This is our last checkpoint before a story gets published. It's actually pretty rare for us to put a story down at this phase because in order to get here, a story has gone through several other checkpoints and each time we decided to keep moving it forward. In short, this stage involves reflecting on your final piece ensuring that your question is answered clearly, your data are robust, and the flow and presentation of information make sense and draw the reader in. If they don't, hopefully it's something that can be fixed up.

Stories we’ve put down at this stage:

  • Shortly after quarantine began in March, I started noticing that Netflix uses the word "irreverent" to describe shows very frequently. It piqued my curiosity enough that I collected data looking at descriptive tags used for over 100 titles on the platform and found that irreverent was the second most common tag used. A small story, certainly, but it was interesting and intriguing enough for me to move it forward. My problem was that I took too long to publish it. I sat on the story for various reasons, but by the time I felt ok publishing it in June, the story lost all of the things that made it interesting. People were still quarantining, but not as much, nor as vigilantly. Netflix had stopped promoting as many comedies and was promoting more "heartfelt" stories. Our stories don't typically rely on timeliness, but this one did and it missed its window. At the moment, I don’t see a way to pick this story back up.

Some ideas are simple and all of the pieces needed to turn them into a story seem to fall into place fairly effortlessly. But other stories require more work, more consideration, many more twists and turns and repeated attempts before getting it quite right. This process works for us and we hope that it may work for you too.

More “Behind the Scenes” Resources

You can find our company’s policies and procedures here. Check out other similar resources below:

Blog Posts

Talks & Courses

Videos

If you liked this, you may enjoy the early story releases and extra Behind the Scenes content made available exclusively for our Patreon Supporters.