Lies, Damned Lies, and Statistics

“There are three kinds of lies: lies, damned lies, and statistics.” This quote of indeterminate origin and incorrectly attributed to but made famous by Mark Twain, came to mind when I received questions from friends and associates about an opinion piece recently published in the New York Times.  The questions raised from Just Because You Test Positive for Antibodies Doesn’t Mean You Have Them, included:

  • Does this mean that 70% of the people tested will test positive for COVID-19 antibodies?
  • Do you believe these numbers are correct?

My response was no and no.

Reporting just on a parameter estimate to an average audience can generate confusion. This is confirmed by the questions above. The article concept is correct that a base rate of a population will impact the outcome of a testing effort.  Applying a statistical tool (Bayes Theorem) and considering the base rate of the population of having COVID-19 antibodies, determines what the effect is. One of the testing results is a false positive, meaning of the people tested, who tested positive for Covid-19 antibodies, will not have COVID-19 antibodies. NOT, that 70% of the people tested will have Covid-19 antibodies.

With the information provided, I cannot reproduce the numbers indicated in the article, specifically –

“An antibody test with 90 percent accuracy could be as low as 32 percent if the base rate of infection in the population is 5 percent. Put another way, there is an almost 70 percent probability in that case that the test will falsely indicate a person has antibodies.”

Taking the high ground, I will assume Hanlon’s razor, or that adjustments to the numbers were made for journalistic effect. Applying Bayes’ Theorem, below I share my calculations. I do this only to show my approach if this makes your head hurt, jump down to the next paragraph.

Population1.000
Base Rate0.050
Population does not have COVID-19 antibodies0.950
Test Accuracy0.900
1-Test Accuracy0.100
(Test shows COVID 19|Has COVID-19 antibodies =  1 X.050.050
Test shows COVID-19 = (1 X .05) + (.95 X .10) 0.145
(Has COVID-19 Antibodies|Tested positive for COVID-19 antibodies)=(Test shows COVID-19|Has COVID-19 antibodiesXHas COVID-19 antibodies)
Test shows COVID-19 antibodies
1 X .05X0.05
0.145
=0.655
66% Rounded

This means of the people who test positive for having COVID-19 antibodies 34% (rounded) tested will have COVID-19 antibodies and about 66% (rounded) will not have COVID-19 antibodies. Few people, except practitioners of data analytics, will go through this effort to complete this analysis and understand what the statistic (false positive) and the formula calculation mean.

Rather than just communicating 34% or 66% a thought experiment resulting in an example is more effective in conveying the testing outcome of a 90% accurate test with 5% of the population having COVID-19 antibodies.  What would happen if 1,000 people were tested in this situation? Applying the same approach, we get the following results:

 Have COVID-19 AntibodiesDo Not Have COVID-19 Antibodies  
Test Positive for COVID-19 Antibodies509595/145~ 66%
Test Negative for COVID-19 Antibodies0855  

This is example allows a clearer explanation:

  • 855 tested having no COVID-19 antibodies
  • 145 tested positive for having COVID-19 antibodies
  • 95 of the 145 who tested positive for COVID-19 antibodies, DO NOT HAVE THEM
  • Not possible to know of those who tested testing positive for COVID-19 antibodies (95 out of the 145) do not have COVID-19 antibodies.

What is the advantage? Simply reporting a statistical parameter runs the risk of generating responses spanning from confusion to panic. Except for the data analytic practitioners, formulas and parameters are not good input for what relevant action should be taken. What could be actions that could be taken from the table above?

  • 855 tested should be notified that they do not have COVID-19 antibodies and should continue safe practices and be candidates for a vaccine when it becomes available.
  • 145 should be notified that they tested positive for COVID-19 antibodies but should be retested to confirm the findings.

The key learnings here are to recognize as a data scientist: it is not about a parameter that many will not understand or a calculation that they could care less about, it is about giving a business outcome that management can relate to and act on.