Why Being a Data Scientist is Like Being a Plumber
Why This is Important to Businesses Looking to Leverage the Power of Data Science
Like many professionals, a lot of my time, though not as much as it sometimes feels like, is spent in various calls, meetings, and lunches with clients and potential clients. Most of the time, the finer details of the conversation are almost instantly forgotten, and everyone goes away with their specific bullet points and action items. Last month I was in one of these meetings with an executive from a potential client company, and he said something that has stuck in my head ever since he said it.
It stuck because it resonated so deeply with one of my core principles as a data scientist, and it expressed the point so clearly, so completely, and yet so concisely, that it cannot possibly be misunderstood or denied. This has stuck in my mind for over a month now, and every time I reflect on it, I wish every person in the world had been sitting at that table in that moment, so they could’ve heard what this executive said to me. Here’s what happened:
We had been discussing this potential client’s goals for transforming their business through the application of data science. Several potential use cases for statistical modeling and machine learning were identified, and a few specific modeling approaches were even proposed, with the caveat that we would need an appropriate set of training examples for each approach to work. At this point, the executive started asking a string of questions. He asked, “Do you use Python to develop these models?” I replied by explaining there are a number of tools we can use to build and deploy models, and that Python is certainly a very prevalent tool that we use all the time for this.
He followed up with, “Do you use Sci-Kit Learn to build your models?” Again, I replied by explaining there are a lot of really great frameworks for designing, building, and deploying machine learning systems. Sci-Kit Learn is certainly popular, and for good reason, but different situations call for different approaches, which in turn call for different tools to be used. I finished by assuring him that we would be sure to use the most appropriate framework for each situation.
Finally, he asked, “Do you understand and use the mathematics behind these models, or do you only know how to use these frameworks to build them?” I don’t get asked this question often. I replied by stating that I am a mathematician by training, and I have made it a point to obtain a deep understanding of the theory behind each model I use, because I feel it is extremely important to do so. He asked me to describe the theory behind one of the proposed models, and I did so.
When I finished, he shook my hand and awarded the contract on the spot. He then proceeded to tell me about the frustrations his company has faced in trying to successfully bring data science to bear on their business processes. He expressed how hard it has become to find legitimate data science talent, as they are finding that more and more professionals are claiming to be data scientists when all they know is how to use the Sci-Kit Learn API.
Interestingly, he also expressed that he doesn’t feel these people are being intentionally dishonest. He cited all of the data science bootcamps out there that are basically just glorified Sci-Kit Learn tutorials, which might give many people the impression that this is all there is to being a data scientist. Then he said this:
“I can go to any hardware store right now and buy all the tools that plumbers use…
…but that doesn’t make me a plumber.”
And there it was. The perfect distillation of what makes the difference between a data scientist and someone who just knows how to call an elegant machine learning API.
Why is this important to businesses looking to leverage the power of data science?
Allow me to answer this question by sharing yet another interesting recent consulting conversation. I was having a call with a product owner and the lead data scientist on the product team from a company in the healthcare technology sector. This company is responsible for the safe keeping of millions of medical and prescription records, and was attempting to implement a machine learning solution to detect network intrusions.
Nearly a year into the project, they had hit a wall. Their lead data scientist had been hitting his head against this wall for weeks and wasn’t having any breakthroughs, so they wanted a fresh set of eyes on the problem. He was positively gushing with excitement as he told me he had built “an ensemble of 400 artificial neural networks” which were “simultaneously monitoring the network for intrusions in real-time.” He went on to tell me this ensemble had an accuracy of “over 96%” in cross-validation tests.
Then he told me, with much less enthusiasm, that it actually didn’t work. This solution had been put through a more rigorous test, which involved having penetration testers, “pen testers” for short, break into the company’s network to see whether the system detected these simulated intrusions. It didn’t. In fact, it failed to detect any of the simulated attacks across multiple tests.
Obviously disappointed with this outcome, the lead data scientist went back to the drawing board to see what could be done. After several weeks, he was still stuck. This is when they decided to reach out. He reported that his system was still testing with an accuracy “in the very high 90's,” but it couldn’t detect even one intrusion when put through the penetration test.
At this point, I had not been shown any of the training data, so I asked whether network intrusions were rare occurrences. Both the product owner and the lead data scientist assured me they are very rare. I asked what percentage of the training observations were legitimate intrusion attempts. The lead data scientist replied that intrusion attempts only made up “between 3% and 4% of the data.”
I then asked about the recall of the system in the cross-validation tests. Recall is a metric that tells you what percentage of “positive cases” (in this case, read “intrusion attempts”) were correctly identified by a model. He replied that he hadn’t checked this. At this point, I tried to get my point across by asking a question. I said, “If we were trying to detect a disease that effects only 1 person in every 10,000, what’s one way we could immediately get an accuracy of 99.99% in guessing whether a person has the disease?” After a moment of thinking, he replied “I guess we could just always predict that no one has the disease, but that wouldn’t be very useful.”
To make a long story short, after checking the model recall, it was found that the model was simply labeling every observation as a non-threat. Since only 3% to 4% of the observations were intrusion attempts, this was a quick and easy way to minimize the prediction error and achieve an accuracy of over 96%. I also found that this impressive-sounding ensemble of neural networks had been built using keras, another popular high-level API for implementing machine learning solutions, which requires no real understanding of how neural networks actually work.
I have to cringe every time I think about this. Think about the time, money, and other resources wasted by a solution nearly a year in development that was completely incapable of detecting the very thing it was designed to detect, and all because the lead data scientist was unaware of the potential to be mislead by the accuracy of a model in an imbalanced learning scenario. Now think about this happening to your business. Like my executive client from the introductory story, this product owner was not a data scientist. He didn’t know how machine learning and deep learning systems work, so he had to blindly trust that his lead data scientist knew what he was doing.
This is why it’s absolutely crucial that anyone you hire to work as a data scientist has the necessary mathematical chops to do the job properly. A lot of people are interested in the field of data science, because of articles like this one, which name data scientist as the “hottest job of the year” and report a median salary of $116,840. The reality of the situation is that most people never see the level of mathematics involved in machine learning in their lives, and the ones that do, usually don’t until graduate school. With more universities offering degree programs in data science, this seems to be slowly changing, but it’s still the case that the only people who encounter this kind of math are the people who specifically study in a machine learning program. However, I’ve looked at the curriculum in many of these programs, and I have to say I’m sorely disappointed by the lack of depth most programs seem to offer, so it’s still not a guarantee that even these people will have had the exposure to the mathematical ideas.
I often hear people talk about the “democratization of data science,” which refers to the fact that the tools of the data science trade are becoming more and more widely available to more and more users. It always seems to be cast in a positive light that more and more people with little to no understanding of mathematics and statistics are being given more and more access to algorithms they don’t understand. I’m not sure why this is. Why has this perspective been taken with the field of data science, specifically? When the first hardware store opened and made it possible to buy plungers and drain snakes did people flood the streets in celebration of “the democratization of plumbing?”
I definitely know what a plunger is. In fact, I own one! I’ve plunged a few toilets in my days, and I even poured Drano down the drain of my kitchen sink one time. The next time you need to install a sink, or see water dripping from the ceiling below your upstairs bathtub, would you consider calling me? What if I said I’d handle it for less than all those fancy shmancy “licensed plumbers” would charge?
No?… You don’t need to answer right away. Think about it?…
Don’t worry. I got it.