Let’s start with dashboards, which are common in people
analytics. Here is a COVID-19 dashboard from Johns Hopkins:
https://coronavirus.jhu.edu/map.html |
It’s impressive in the amount of data it brings together and
in the ability for the user to change views. You can certainly easily grasp the
major metrics, numerically as well as graphically—which is the purpose of a
dashboard, whether pertaining to COVID-19 or HR metrics such as employee
headcounts. But as with all dashboards, there are at least three major questions.
- Are these the right metrics for what you are trying to understand? It’s easy, for example, to find Twitter threads debating whether total deaths or deaths adjusted for country population is the better measure. But like many debates over metrics, rather than seeing this as a competition over which metric is better, it would be more productive to see various measures as complements that measure different aspects (e.g., total cases reflects the pace at which an outbreak is growing; per capita cases indicates strain on a health care system).
- Are the data accurate or comparable, especially when collected from diverse sources? Do individuals within an organization have a self-interest to report data in certain ways? Or are there different capabilities that produce different measures. As Ryan Lamare reminds me, dashboards and data visualizations work best when there is a common baseline. Otherwise, users need to think carefully about what they're actually seeing and how they're interpreting it. In the COVID-19 case, for example, how should we interpret national comparisons of total tests when testing capacity differs? A similar example in HR might be a comparison of training numbers across units with different training capacities.
- Beyond seeing the scope of a current situation, what actions can you take from metrics that are largely descriptive? This dashboard, for example, shows which areas have the most cases by COVID-19 but how do we act upon that information? An HR dashboard might reveal areas of an organization with low employee engagement, but is unlikely to help reveal why.
Next, here is a visualization from John Burn-Murdoch of the Financial Times that
has also frequently been spotlighted:
https://www.ft.com/coronavirus-latest |
This is a great visualization for seeing trends within countries, and across them, too, if you carefully remember what's being compared. In a people analytics context, this could be seen almost as a
scorecard to see how your organization stacks up against others, or how areas
within your organization compare to each other. But there are at least three
things to be cautious about. First, there are the same concerns as with a
dashboard—are these the right measures, the right comparisons, is there a common baseline, etc. (in fact,
the source of the data for this visualization is the John Hopkins dashboard
data, so the same concerns apply). Second, the nature of visualization tempts you to forecast into the
future. But what’s the basis for that forecast?
For example, let’s go back to the March 15 version of the
same visualization:
https://twitter.com/jburnmurdoch/status/1239276487062233089?s=20 |
A third caution for people analytics that we can take away
from this visualization is a reminder that this metrics-focused approach
doesn’t inquire as to what factors influence the trends portrayed. Note that is
doesn’t claim to, so this isn’t a criticism per se. Rather, it’s a reminder
that if you want to act upon information by, for example, implementing new HR
initiatives, you should always be asking what’s influencing the metrics. What
levers can you nudge that will change the metrics in the desired ways? Even if
you can’t estimate an actual regression, it can be helpful to approach problems
with that mindset—what variables would you like to include in a regression to
explain the metric? In the absence of a regression, is there other evidence to
support the importance of these factors? What’s missing from your (mental)
model?
Thinking about factors that influence a trend or a metric represents a shift from a metrics approach to more of a predictive
analytics approach. In the COVID-19 pandemic, this is reflected in the importance of statistical models for policy-making—for example, using predictions from models for implementing
stay-at-home orders. Let's consider two broad approaches.
One approach to modeling the spread of COVID-19 essentially tries to figure out the shape of the curves in the above visualizations by fitting statistical parameters to the curves that are the most complete (e.g., China, Italy). If you then assume that the lagging countries (or other geographical units) are on an earlier part of that same curve, then you can predict where those countries are headed. This is the approach of the Institute for Health Metrics and Evaluation (IHME):
One approach to modeling the spread of COVID-19 essentially tries to figure out the shape of the curves in the above visualizations by fitting statistical parameters to the curves that are the most complete (e.g., China, Italy). If you then assume that the lagging countries (or other geographical units) are on an earlier part of that same curve, then you can predict where those countries are headed. This is the approach of the Institute for Health Metrics and Evaluation (IHME):
Importantly, note the shaded area which reflects a 95% confidence
interval. And note that it’s quite large for the immediate future. This is a
good reminder for people analytics that estimates are just estimates. There is
always uncertainty, and it’s important to understand the magnitude of that
uncertainty before making decisions.
But note that this curve-fitting approach is akin to a data
mining exercise. There is no epidemiological model that underlies these forecasts. In HR, this would be like observing the retirement ages of previous
workers, and predicting a particular worker’s retirement probability based
solely on their age. There’s no accounting for that person’s particular
characteristics or changes in the environment particular to that person.
As an alternative modeling strategy, a long-standing epidemiological approach is the
susceptible (S)-exposed (E)-infected (I)-resistant (R) model (SEIR, for short)
(or alternatively, a SIR model with three classes: susceptible, infected, and
recovered individuals). A SEIR model starts with the number of susceptible, exposed,
infected, and resistant individuals, and then sets up a formulaic relationship
across the categories based on estimates of incubation periods, frequency of
contact across individuals, the probability of being infected after exposure,
and the like. The spread of COVID-19, hospitalization usage, and other outcomes
can then be simulated by projecting out what happens as exposure and infection
increases. And by changing key parameters, you can also forecast alternative
scenarios, such as the impact of various social distancing measures. This type
of model is being used to guide public policy in Minnesota.
An analogous people analytics example would be a workforce
planning model where you start with the current number of employees and make
assumptions about retention rates, mobility, hiring rates, and future needs.
This creates forecasts into the future, and by changing different assumptions,
you can model alternative scenarios, forecast shortfalls, and infer needed
responses.
Note that there is expert judgement or past empirical trends
built into this model—it’s not just curve fitting. And a realistic recognition
of the range of uncertainty around the underlying assumptions yields confidence
intervals that help inform how strongly you should interpret the results. These confidence intervals, or estimates of uncertainty, can be seen here (in red) for the Minnesota modeling of COVID-19, and at the same time, note the modeling of different scenarios (rows) and the estimated impact on different metrics (columns):
|
Unfortunately, the IHME's curve-fitting model and Minnesota's SEIR model give very different predictions of where we're headed. Both approaches contain significant unknowns, such as how well (or not) states or countries fit the earlier experiences of China (which had much stricter social distancing) and Italy because there are so many variables that presumably affect how the outbreak spreads, or in the SEIR approach, whether key parameters are accurate because COVID-19 is a new virus. This highlights the importance of understanding the nature and limitations of any kind of statistical model, and paying attention to the sensitivity of the results. The starkly-different projections of these particular models are also a reminder that actions based on statistical models will only be as good as the explanatory power or fit of those models. Ideally, imprecision in the degree of fit will translate into margins of errors and confidence intervals, but if a model is being applied to a new situation, then purely statistical margins of error maybe too conservative. The onus is always on the decision-maker to use their subject-matter expertise when interpreting and applying statistical results. But what to do when you have to make a decision? Explicitly recognize the decision rules and include the costs of making different types of inferential errors in any decision calculus.
Putting all of this together, then, a good people analytics person is always skeptical—or at least probing…where did the data and assumptions come from, how do we know they are accurate, how sensitive are the results to particular assumptions, how much uncertainty is there, what’s the decision-making criteria, what’s missing? And notice that this is as much about subject-matter expertise—whether that's infectious diseases or human resources—as it is about statistical sophistication. It's not just data mining.
It might also be useful to note that neither of these
modeling strategies (curve-fitting or simulation based on parameterized flow
models) match the dominant predictive approach in HR, especially in HR research
(I don’t say this as a critique, just as another point of comparison). From a
social sciences perspective, it’s much more common to predict outcomes in a
regression framework where an outcome variable is modeled as a statistical
function of a set of explanatory variables. For example, if employees’ level of
engagement with their supervisor (inversely) predicts an intention to quit,
then if an organization can increase engagement, we’d expect that quit
probabilities would decrease, albeit imperfectly and with variation. This is a
reminder that analytically, some issues are best modeled as societal phenomena, some modeled at an organizational level,
and some at an individual-level. They each involve unique measures, and their own
analytical challenges. A good people analytics person matches the methods and
data to the problem—while still being probing as defined in the previous paragraph.
Lastly, COVID-19 dashboards and modeling raise challenging
ethical questions. What data are being collected and how are they being used?
Are metrics and results being presented in sensationalized or inaccurate ways?
What’s the role of modeling in determining public policy decisions? There are
no easy answers to these and other ethical challenges, but they are a good
reminder that people analytics also involves important ethical challenges. How
is employee data being used? What kind of consent should be required? How
transparent is the decision-making? Are implicit biases embedded in modeling
decisions furthering rather than redressing historical inequalities? Throughout
the people analytics process, it’s essential to remember that most data, and
certainly most decisions, pertain to real people, not data points in a database
or costs on an income statement. The science of people analytics is important,
but so is the humanity. And in terms of presenting data in skewed ways, this has long been recognized as a danger with statistics, and perhaps the best defense is to be a wise consumer of statistics who doesn't naively take everything at face value (see "probing" above).
In closing, it’s nice to see data visualization, dashboards,
and statistical modeling getting such public attention, but it’s obviously
unfortunate that this is because of a global pandemic that has harmed so many
people and communities. While not losing sight of what’s most important, there
are also lessons here for people analytics.
No comments:
Post a Comment