The REAL Correct Way to Handle Missing Data
An important part of my job as the principal data scientist of a consultancy is interviewing and hiring data science talent to meet the needs of our ever-growing client base. Our data science team is structured in a way that would certainly seem familiar to those in the medical profession. When you’re sick or injured, you usually go to the hospital. The hospital has many patients at any given time, and so the practitioner who sees you is almost certainly not going to be the attending physician or the chief of medicine. Rather, you’re more likely to be seen first by a resident or intern, or these days a physician’s assistant or a nurse practitioner.
These professionals are the first line of defense and must be trained to handle most of the basic and mundane cases. The attending oversees the patient care and is only called in to see a specific patient if the intern or resident is unsure of how to proceed. Similarly, our consulting practice has a lot of clients at any given time, and I oversee all our projects. However, because I can’t be everywhere at once, I’m usually not the consultant deployed to the client site. I’ll send one of our consultants, because I know he or she is capable of handling at least most tasks the client needs addressed. My hope is that consultants will be able to handle projects on their own and will only need to call me in for the most challenging problems.
Because this is our business model, I need to make sure we only bring in top-quality talent. I am thorough in my screening process. If consultants aren’t skilled and knowledgeable, they end up coming to me for help about almost everything under the sun. Then those consultants’ projects take up all my time. I end up working extra hours just to keep up with the project load and to make sure our deliverables are top quality and on time.
In my screening efforts, I’ve begun to notice a trend in the candidates I interview. This trend is in the answer I’m getting to one specific question I ask: “How do you handle missing data?” Every time I ask this question lately, I invariably get this answer back: “You either remove the observations with missing information, or you fill in the missing values with some measure of central tendency, such as the mean for numerical variables or the mode for categorical variables.” This answer is given so consistently that I feel the need to address it.
You might be asking yourself, “What’s wrong with this answer? I’ve done that before.” Me too. However, it’s usually not what I’ve done on projects, and when I have, it was a last resort. This answer is what I would call an Ivory Tower answer, because it’s one that comes from a place of pure theory. It’s one that demonstrates absolutely no practical knowledge of being a data scientist in industry or dealing with a client on project.
So, what’s the REAL answer? Simple. You talk to your client. Now, this answer is so short, and so simple, that for some of you this answer may even be downright disappointing. Then please allow me to elaborate by sharing an experience I once had.
Several years ago, I was part of a team that built a model for a client whose business was built largely around giving quotes on large orders. The model provided recommendations to quote specialists on how to discount the items in the quote based on what was in the order, the quantities being ordered, and previous purchasing behavior exhibited by the customer, among other things. The idea was to build a model that could strike the ideal balance where the items were discounted just enough that the customer would purchase the order, but not any more than necessary, so that profits were still maximized.
In addition to all the Bayesian statistics and machine learning algorithms you might imagine went into the design of this model, it was also important to the client that we incorporate certain business rules. For example, they wanted to maintain a minimum profit margin of 25 percent on any item sold. The client also wanted to honor the published list price. That is, the client never wanted to bid above the list price, even if the data indicated that a client might pay a higher price. These constraints on the model required us to know both the cost and list price of every item in the client’s catalogue.
It certainly seemed like a given that the client would have this data. However, once executives sent us the data set for all items in their catalogue, we found that we were missing either the cost, or the list price for a pretty significant portion of the items. To give some context, this client sold items that were as little as 10 cents and as much as $4,000. It was very unlikely that setting all the missing costs or list prices to the mean would have given a sensible value. So, what did we do?
We had a five-minute phone call with the client company’s head of sales. We explained the situation, and he told us that they almost always structured their prices so that cost was around 40 percent of the list price. So, anywhere we had list price, but no cost, we could impute cost at 40 percent of the list price. For any item that had a cost, but no list price, we could impute the list price so that cost was 40 percent of the imputed list price. It was that easy!
There was no measure of central tendency or mathematical theorem that would’ve given us the correct answer in this situation. We only knew what to do by talking to our client, so we could understand the business logic around the variables in question. Some readers might be thinking perhaps we could’ve built a simple model to infer the relationship between cost and list price. While this is probably true, this is still very much an Ivory Tower answer. Using a model to infer the relationship still doesn’t beat getting a direct insight into your client’s business logic, and, more importantly, getting the client’s blessing on an agreed upon path forward.
This brings us to the topic of “stakeholder buy-in.” To have a successful data science project, you need the stakeholders to approve of what you’re doing, how you’re doing it, and with which data. I’ve seen entire avenues of analysis and modeling within a project be slowed or halted completely, because, while the mathematics and theory behind the analysis were sound, the stakeholders just couldn’t or wouldn’t get on board. Sometimes, this can be resolved by simply finding a more clear and concise way to explain the analysis to the stakeholders. Sometimes, however, this isn’t possible, because the reason the stakeholders are objecting has to do with their business logic, values or goals.
The importance of this simply can’t be overstated. We should always be considering the business logic, values and goals of our clients, ensuring that our clients’ objectives are our objectives. Brute force methods, such as dropping observations or assuming all the missing values can be replaced with the mean, should only be used as a last resort in cases where there is no clear guidance from the client. Even in these cases, it’s only something you should do with the approval of your client. You definitely don’t want to find out that company leaders find this method a little too rough or haphazard after you’ve already moved forward with it. Rather, it’s always best to make sure everyone is comfortable with how the project is moving forward at every stage.
This may seem as though it’s inefficient and slow, and deadlines are a very real part of any consulting engagement. It can be hard to feel good about spending more time doing extra analyses to see what kinds of results you get replacing missing values with a measure of central tendency to determine the best measure to use, and to have a clear explanation for your client about the pros and cons of any particular choice. However, your project time scope is exactly why you need to approach this issue in this way. It ultimately saves time to ensure you don’t waste it pursuing something your stakeholders are going to second guess and are probably never going to agree to.
A final thought on how to proceed if your client is unable to provide any useful guidance about how to handle missing data: See if you can make solving the problem as small and simple as possible for them. Let’s say that our conversation with the head of sales hadn’t gone so swimmingly. Say they didn’t have a clear, consistent strategy around the relationship between cost and list price. What then? As part of the process of brainstorming a solution, I would focus on methods to shrink what we need to ask of them in order to fix the problem.
In this particular case, we got lucky. The items were organized into departments, and also divided into item categories within each department. Rather than simply taking the mean cost or list price of the entire data set, I would’ve tried aggregating the items within the item categories and subcategories before taking the measure of central tendency. This way, I could know, and tell my client, that the measure used only involved similar items. This way, I wouldn’t be asking the stakeholders to accept something so rough. Also, if there were certain cases where they weren’t comfortable with the achieved result, then there are a lot fewer data points for which I would need a cost or list price. My ask of the client is likely to be much smaller and simpler with this approach than just trying to find the mean of all the items.
I’m sure there will be readers who can come up with all kinds of “what ifs” about different specific scenarios in which one may find oneself. I’m not going to be able to cover every possible scenario in this article. These examples are simply meant to illustrate a general approach and way of thinking when dealing with this issue. I’ll attempt to sum up the ideas with a few simple rules to follow.
Always talk to you client about missing values in the data.
If they have clear guidance to give, take it. If they don’t, see if you can gain an understanding of the process being described by the data, and how the details translate into the specific representation you see in the data. Try to gain an understanding of their business logic, values and goals around the process represented in the data. You should try to understand all of this even if they have clear guidance for you. In the event that they don’t, this should be used to try to come to clear guidelines you can follow.
If all else fails, go ahead and use simple models to infer relationships, or measures of central tendency.
Always get the clients’ approval to proceed with the method you think is best before proceeding.
Following these guidelines will always make your project run more smoothly and your clients happier than if you just resort first to the above-mentioned Ivory Tower answers.