Bad Data Visualization in the Time of COVID-19

These three data viz pitfalls, which have proliferated during the pandemic, are instructive for data designers and consumers alike

Danny D. Leybzon
Nightingale
Published in
11 min readMay 29, 2020

--

To paraphrase a popular idiom: there are lies, damn lies, and data visualizations. This saying holds especially true in times of high pressure, such as in the face of the global COVID-19 pandemic. In these uncertain times, rational people turn to data to help inform their opinions and decisions. Unfortunately, even intelligent people can fall victim to logical fallacies, cognitive biases, and creative misrepresentations, especially when the stakes are high.

And data visualizations are especially prone to misrepresentation and abuse. Data visualizations are powerful tools for communicating about data, and with great power comes great responsibility. That’s why it’s incredibly important to be thoughtful and mindful about the data visualizations we create and consume, especially when the conclusions we draw from these data visualizations inform life and death decisions. Nightingale has already published two great pieces on the topic: one about ethical design considerations and one about interpreting data visualizations. These posts do a great job of providing general recommendations about data visualization and I wanted to write a complementary post that dives deeper by dissecting some specific data visualizations.

My career in data science has always centered heavily on data visualization. My most-read Medium post heavily relies on data visualizations (as do all of my published data analyses) and in 2017 my team won Best Data Visualization at UCLA’s data science hackathon. As a result, I am especially thoughtful about data visualizations that I see, both in social and news media. I’ve compiled a list of three data visualizations that I think succinctly capture most of the errors in data viz that I’ve seen during this pandemic and analyze the errors in these visualizations. I provide these critiques not as an indictment of the creators of the visualizations but rather as a tool for education. Hopefully, we can look at these charts and graphs, and all become more thoughtful about the data visualizations that we create and consume.

1. Distorted Y Axes

From Reddit

The first example of misleading data visualization comes to us courtesy of Reddit but was originally propagated by Fox news. It displays a line chart, with the X axis representing days (between March 18 and April 1) and the Y axis representing the number of new cases in that day (presumably new cases of Coronavirus in the US, though the chart comes without a source or legend). The graph seems to indicate more or less linear growth between March 18 and March 24, a drop on March 25, a steep jump on the 26, and then more or less random variation after the 26.

You, like me, might first scratch your head about what makes this a bad data visualization. Sure, it’s not exactly pretty, it’s missing a legend and data source, the circles indicating data points are a bit large making their exact position hard to interpret … But really these are both minor sins in the world of data viz. I’ll admit that it’s not until I read the title of the Reddit post that I realized what’s so wrong about this visualization: It has an incredibly misleading Y axis.

Rather than increasing periodically as we would expect, the Y axis on this chart sporadically jumps. The lines are evenly spaced, but the value of the y-variable doesn’t increase by a fixed amount, as it would in a normal graph. Rather, the difference between y-values goes: 30, 30, 10, 30, 30, 30, 50, 10, 50, 50, 50. Wow, that’s kinda crazy, why would somebody do that?

To be fair, it can sometimes be useful to apply the logarithm function to our y-variable, thereby effectively “scaling down” the Y axis and making exponential trends easier to examine. However, graphs using logarithmic scales on the Y axis should always indicate that fact (in order to prevent misleading people into believing that an exponential curve is linear) and, more importantly, this isn’t even a logarithmic scale! The difference between y-values drops from 30 to 10, returns back to 30, up to 50, drops dramatically to 10, and jumps to 50 again! No wonder this graph looks so all over the place …

In order to answer the question of intention, let’s take a look at what this graph would look like with a normal Y axis:

Created using Google Sheets

Without comparing the two graphs side by side, the difference may be small enough to squeeze past a casual viewer’s eyes unnoticed. The new graph retains the same basic shape as the old one, with peaks and valleys in the same place. But as soon as the graphs are put next to each other, it becomes obvious that the second half of the graph has been subtly “squeezed” or compressed, making the increases in daily cases look less significant.

So, if it’s not an intentional use of logarithms, what could explain this graph craziness? The answer might be incompetence or it might be malice. We can give some Fox news editor the benefit of the doubt and assume that they were merely trying to make the graph look pretty or something. However, I think that a very good case could be made that the creator of this data visualization was intentionally obscuring trends by messing with the scale of the Y axis. By subtly squishing the top half of the chart, the creator makes the coronavirus outbreak look like it’s spreading more slowly than it actually is. I won’t comment on potential motivations for such a misdirection, but you are free to draw your own conclusions.

2. Correlations and Causations

From Medium

There’s a good chance that you’ve seen the data visualization that I picked for my second example. There’s even a good chance that you’ve shared it and, honestly, I can’t blame you. It makes a very compelling, simple case for the use of masks, and I am decidedly pro-mask. However, it’s an example of bad data visualization. The original visualization was created by John Burn-Murdoch for the Financial Times and is incredibly useful and well made; check out the updated version. However, there are annotations made to it that lead people to draw tenuous conclusions.

The graph shows the relative growth rate by country in the cumulative number of cases, with the X axis representing the number of days since the 100th case in that country, the Y axis representing the cumulative number of cases (notice the appropriate use logarithmic scaling, so straight lines actually represent exponential growth), and each curve representing a different country. So far so good. The issue comes when another commentator, Joseph Perla, added a pair of hand-drawn circles. One circle, in the upper left corner (representing fast growth in cases) includes countries such as the US, Spain, Italy, and China, and is labeled simply “No Masks.” The other circle, in the lower right corner (representing slow growth in cases), includes countries such as Singapore, South Korea, and Japan. The implicit argument being made is that masks help “flatten the curve” (i.e. lower the rate of growth of the cumulative cases figure), as evidenced by the fact that countries with mask usage had lower growth rates than countries without mask usage.

To understand why it’s a bad data visualization, we must divorce ourselves from whether we agree with the conclusion being drawn from it, and objectively evaluate the premise we are presented with. It is possible to come to a correct conclusion from incorrect premises and it is possible to accept a conclusion to be true even if you don’t accept each argument for it.

The data visualization is flawed because it fundamentally relies on a flaw in implicit human reasoning. Without saying as much, the visualization is presenting the argument that “correlation between mask usage and lower growth rates implies that mask usage causes lower growth rates.” It’s possible that alarm bells just went off in your head if you noticed the parallel between this articulation and the common phrase “correlation does not imply causation.”

Still, it’s useful to dive deeper into the visualization (and the implied argument that it presents) in order to understand why it’s faulty. After all, even people who could complete the phrase “correlation does not __ ____” in a heartbeat might find themselves falling prey to the logical trap. They might find themselves thinking “I mean, it makes a lot of intuitive sense that masks slow the spread of COVID-19, right?”

It definitely makes a lot of sense and this Atlantic article does a great job of summarizing why mask-wearing is effective, but this data visualization does not prove that, and no argument from correlation can. Variable a being correlated with variable b, no matter how strong the correlation, does not mean that variable a causes variable b. It’s possible that variable b causes variable a. It’s possible that both variables a and b are caused by some third, confounding variable c.

An easy and humorous way to discredit this argument-by-visualization is by replacing the “mask” and “no mask” labels with other labels. This virologist’s critique of the visualization on Twitter spawned a small thread of these silly relabelings. The regions in the lower circle (Japan, Hong Kong, South Korea, and Singapore) are all small, highly-developed, and East Asian. They have a lot of things in common other than wearing masks; so how do we know it’s not something else that they all have in common that causes them to have had lower infection rates than the other category?

In fact, we do know that some of the things that they have in common caused their lower rate of increase in new cases. One particularly compelling theory, as explored in this Vox article, is that high state capacity is the cause of the disparate abilities of governments to react to this crisis. As the Vox article explains, there is strong evidence to support the conclusion that state capacity is a strong predictor of the rate of increases in new cases. Additionally, these territories all had practice preparing for a sudden acute respiratory syndrome pandemic, as they were all hit by the original SARS Coronavirus pandemic in the early 2000s; here’s some personal testimony from an Amnesty International employee about how the first SARS epidemic shaped Hong Kong’s response to COVID-19.

Now, these arguments don’t disprove the efficacy of masks, nor am I advocating that people abandon the use of cotton masks — please save N95s and surgical masks for our healthcare workers who need them the most. All they do is disprove the argument that this graph implicitly makes, namely that a correlation between two variables necessarily implies a causal relationship between them.

3. A Failure to Normalize

From blog post

Our third and final bad data visualization comes courtesy of Chris Holland, who says that he saw the original visualization on Facebook. This is an example of an error that I see frequently, one that I railed against in my blog post “The Best Time to Post on Reddit” and which I think is best explained in this xkcd comic.

From xkcd

The original, erroneous visualization actually has two components, both maps. On top, there is a map of 5G coverage, with the highest densities in the population centers of the US, such as California’s West Coast, Portland, Seattle, and a giant mass representing the US East Coast, Gulf Coast, and part of the Midwest. Below, there is a map of coronavirus cases (without sources for the data or more information about where it came from, which is another issue entirely), which looks eerily similar, representing cases in America’s urban centers and a lack of cases in the rural parts of the US. There is no sourcing for data on either visualization.

Fortunately for us, Chris already pointed out the errors with the original argument but it’s still worthwhile to dive a little deeper into the logical errors that go into creating this visualization. Although he, and Randall Munroe in the xkcd comic that I linked to, do a good job of explaining the ridiculousness of this argument by extending the logic with humorous examples (noticing a trend here?), it can be helpful to verbalize the arguments being made in order to elucidate them.

Chris identifies the error in this situation as the same issue that we saw above “correlation vs causation.” I would characterize this slightly differently and more specifically as a failure to normalize by population. This is actually an incredibly common error, not just when we compare two maps. Many maps and datasets present “raw statistics” for figures of interest, like the number of heart attacks, number of bear attacks, number of ice cream cones sold, or, in this case, number of 5G towers and number of Coronavirus cases. These raw statistics can sometimes be useful on their own; an example of this is actually comparing the spread of Coronavirus infections between countries, as John Burn-Murdoch illustrates here. To get a full understanding of the data, it is important to examine both the per capita rate and the raw count rate. Then, you can decide which is the correct metric to use. In this case, the raw count data grossly misrepresent potential relationships in the data.

By failing to normalize by population, when there is a strong correlation between a particular variable and the population, we might find that we are actually simply creating population maps, rather than truly visualizing the variable in question. That’s what happened in the xkcd comic, in Chris’s examples, and in the original visualizations. And when we visualize the population instead of the variable that we want to visualize, we falsely attribute conclusions drawn about the population to conclusions that we might draw about the variable in question.

Conclusion

In this post, I’ve identified a few bad data visualizations which I’ve seen floating around the interwebs. I chose these visualizations not because I have malice for the creators or disdain for the distributors, but rather because I felt that they could be good educational tools for helping to explain the importance of data visualization and how it can be abused. Data visualization is a subtle art and science, combining elements from statistics and mathematics, graphic design, and visual art. They are incredibly powerful tools for helping humans make sense of the world, acting as a vehicle to help people digest and integrate data that might otherwise be too overwhelming for them to understand. It’s important that we continue to use these tools to help people understand the world better, not to further obfuscate it.

If you enjoyed this writing and want to read more of my work, check out my Medium or my personal website. If you’re interested in connecting, feel free to reach out via my Twitter or LinkedIn. If you have any questions, comments, concerns, or other bad data visualizations that you want to call out, please feel free to do so in the comments or by emailing me at dannyleybzon@gmail.com.

--

--

Danny D. Leybzon
Nightingale

Data Specialist, Reading Enthusiast, Amateur Adventurer