Dispelling modeling misconceptions in the Coronavirus era

Data Science   |   
Published September 14, 2020   |   

A recent saga between Rhode Island and Massachusetts over COVID-19 case levels and travel bans has confused and inconvenienced New England travelers. But the regional drama is also just one part of the larger problem. With no agency at the federal level collecting and sharing state-level data in a uniform way, states and the general public continue to be at the forefront of interpreting data and models for accuracy.
Those that work with big data on a daily basis are constantly working to read the subtext of existing models or create new, more accurate ones. As the pandemic wears on, however, it’s increasingly clear that modeling misconceptions are now an issue impacting the everyday lives of both those in big data, and consumers at large.
From the vantage point of a modeling expert, there are some valuable concepts that both of these groups should keep in mind when working to understand models.

Are All Models Wrong?

At a basic level, we often hear adage “all models are wrong, but some are useful.” Pundits and scientists alike are echoing it in their efforts to inform the public about the nuances of modeling. But is it an accurate statement? Are models intrinsically wrong? Or are we misunderstanding the goals of modeling?
In my experience, most models aren’t wrong. Rather, they are always improving. At any given point in time, they are only as “right” as the data and assumptions that we feed into them. As statisticians and data scientists, it is our job to check our assumptions and our data sources and update models constantly as new data comes in.
This process of tuning, validating, monitoring and retraining models is continuous.

How Should We Read Models?

From a consumer perspective, when evaluating the utility of a good model, it is important to use great care. Especially in a situation with a novel disease, when so much information is unknown, we have to explain the usefulness and the limitations of our models. And we must be prepared to update the model often.
To better explain how modeling works, let me outline some important concepts.
A model is a simplified representation of reality or real-world phenomena. It’s a mathematical equation that helps us understand or predict real-life events using a sample of data available at hand. These models represent different aspects of the novel coronavirus and the spread of disease, including:

  • How many people are projected to be susceptible? ​
  • How many people are projected to be infected?
  • How many people might die? 
  • How many people are projected to recover?​
  • How many hospital beds will be needed, including acute, ICU and ventilator beds? ​
  • What are resource use estimates?​
  • What is the impact of R0?
  • What are the impacts of various interventions, such as social distancing? 

One common epidemiological model that takes many of these questions into consideration is called the SEIR model, which stands for susceptible, exposed, infected and recovered.
Complex models are dynamic. Models are made up of multiple, different equations. And different variables in the model change in response to each other. For example, if we change our behaviors by social distancing, this will adjust how many people are infected by each person with the disease. This, in turn, will affect the results of the models that estimate how many ventilators or hospital beds we need.
Models are “fit to data” to test hypotheses, provide insights, assess the magnitude of effects and generate predictions about the near future. Fitting the model is done through a trial-and-error process of changing parameters from one extreme to another to understand how those changes affect the result. In the case of disease modeling, we can adjust how many people we think are already infected, how many new people are infected, how many people have built up an immunity, and so on.
Models are only as good as the assumptions they begin with, or only as good as the data they are fit to. Data quality matters. We assume sample data to be representative of the population we draw it from. We should use real data where possible and align our expectations with the intended purpose. For example, it is problematic to take a model originally designed for a health provider system’s specific market and then run that model across the entire state by expanding the population of the market to the whole state.
A better practice is to take a regional-level model, adjust for things like age and population density for that area, and then aggregate those regional numbers in a “super region.” This ensures that we are not introducing unnecessary bias based on regional differences into the model. New York, for example, has rolled up its counties to 11 regions. Each region is considered an independent population in which models are run to provide better projections.
Models need constant refinement and data feeds, and like weather models, are only accurate in the coming weeks, but less so in the coming months. In other words, we need a surveillance capability to keep monitoring the situation with a plethora of analytical techniques. Currently, many models offer estimates for daily cases during this wave of the pandemic. Past disease epidemics have taught us that we can expect a second wave possibly in the fall or winter. We will use everything we learn from this wave, including the rate of spread and the number of asymptomatic cases, to refine models that are built to estimate a second wave. And then there is the biggest unknown: how the virus itself will behave in the summertime when exposed to dynamic social interaction networks and mutational effects.As we learn more about seasonality and mutations, we will be able to refine models with that information too.
Ultimately, models can help us test ideas about the real-world using statistics. They aren’t designed to give us an absolute answer. But they can project possible outcomes, depending on the decisions we make.
In other words, models don’t tell the whole story, and they don’t predict the future. But they can help assess whether relationships observed in the data are likely to occur in the real world. They allow us to ask “what if?” and to forecast the near future based on what is happening now. By understanding how to read models ourselves, we can all be better equipped to dispel the misconceptions when we see them moving forward.