There’s a huge difference between the purely academic exercise of training machine learning (ML) models versus building end-to-end data-science solutions to help solve real enterprise problems.
Editor's Note: This article is excerpted from chapter 5 of Artificial Intelligence: Evolution and Revolution
This chapter summarizes the lessons learned after two years of our team engaging with dozens of enterprise clients from different industries, including manufacturing, financial services, retail, entertainment, and healthcare, among others.
6. Data Is Often Unbalanced
Say you have a dataset with labeled credit-card transactions and 0.1% of those transactions turn out to be fraudulent, whereas 99.9% of them are good/normal. If we create a model that says that there’s never fraud, guess what? The model will give a correct answer in 99.9% of the cases, so its accuracy will be 99.9%! This common accuracy fallacy can be avoided by considering different metrics such as precision and recall.
These are defined in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN):
- TP = Total number of instances correctly predicted as positive
- TN = Total number of instances correctly predicted as negative
- FP = Total number of instances incorrectly predicted as positive
- FN = Total number of instances incorrectly predicted as negative
In a typical anomaly-detection scenario, the primary goal is to minimize false negatives—for example, ignoring a fraudulent transaction, not recognizing a defective chip, or diagnosing a sick patient as being healthy— while not incurring a great number of false positives.
Precision = TP/(TP + FP) Recall = TP/(TP + FN)
Note that precision penalizes FP while recall penalizes FN. A model that never predicts fraud will have zero recall and undefined precision. Conversely, a model that always predicts fraud will have 100% recall but a very low precision due to a high number of false positives.
The use of receiver operating characteristic (ROC) curves in anomaly detection is discouraged. This is because the false positive rate (FPR), which ROC curves rely on, is heavily biased by the number of negative instances in the dataset (i.e., FP + TN), leading to a potentially small FPR even when there’s a huge number of FPs.
FPR = FP/(FP + TN)
Instead, the false discovery rate (FDR) is useful to have a better understanding of the impact of FPs in an anomaly detection model:
FDR = 1 – Precision = FP/(TP + FP)
7. Don’t Predict. Just Tell Me Why!
We have come across several projects in which the goal is not to create a model to make predictions in real time but rather to explain a hypothesis or analyze which factors explain a certain behavior. This is to be taken with a grain of salt, given that most machine-learning algorithms are based upon correlation, not causation. Some examples are:
- Which factors make a patient fall into high risk?
- Which drug has the highest impact on blood test results?
- Which insurance-plan parameter values maximize profit?
- Which characteristics of a customer make him or her more prone to delinquency?
- What’s the profile of someone involved in customer churn (a “churner”)?
One way to approach these questions is by calculating feature importance, which is given by algorithms such as random forests, decision trees, and XGBoost. Furthermore, algorithms such as Local Interpretable Model- Agnostic Explanation (LIME) or SHapley Additive exPlanations (SHAP) are helpful to explain models and predictions, even if they come from neural networks or other “black-box” models.
8. Tune Your Hyperparameters
Machine-learning algorithms have both parameters and hyperparameters. They differ in that the former are directly estimated by the algorithm— for example, the coefficients of a regression or the weights of the neural network—whereas the latter are not and need to be set by the user—for example, the number of trees in a random forest, the regularization method in a neural network, or the kernel function of a support vector machine (SVM) classifier.
Setting the right hyperparameter values for your ML model can make a huge difference. For instance, a linear kernel for an SVM won’t be able to classify data that is not linearly separable. A tree-based classifier may overfit if the maximum depth or the number of splits is set too high, or it may underfit if the maximum number of features is set too low. Finding the optimal values for hyperparameters is a very complex optimization problem. Here are a few tips:
- Understand the priorities for hyperparameters. In a random forest, the number of trees and the max depth may be the most relevant hyperparameters, whereas for deep learning, the learning rate and the number of layers might be prioritized.
- Use a search strategy like grid search or random search. The latter is preferred.
- Use cross-validation by setting aside a separate testing set, splitting the remaining data into k folds, iterating k times using each fold for validation (that is, to tune hyperparameters), and using the remaining data for training. Finally, compute average quality metrics over all folds.
9. Deep Learning May Be a Panacea
During the past few years, deep learning (DL) has been an immense focus of research and industry development. Frameworks such as TensorFlow, Keras, and Caffe now enable rapid implementation of complex neural networks through a high-level application programming interface (API). Application types are countless, including computer vision, chatbots, self-driving cars, machine translation, and even games (including one that can beat the top chess computer in the world).
One of the main premises behind DL is its ability to continue learning as the amount of data increases, which is especially useful in the era of big data. This, combined with recent developments in hardware (e.g., graphics processing units, or GPUs) allows the execution of large deep-learning jobs, which used to be prohibitive due to resource limitations.
So, does this mean that DL is always the way to go for any machine-learning problem? Not really. Here’s why:
- Simplicity: The results of a neural network model are very dependent on the architecture and the hyperparameters of the In most cases, you’ll need some expertise on network architectures to correctly tune the model. There’s also a significant trial-and-error component in this regard.
- Interpretability: As we saw earlier, a number of use cases require not only predicting but also explaining the reason behind a prediction. Why was a loan denied? Or why was an insurance policy price increased? While tree-based and coefficient-based algorithms directly allow for explainability, this is not the case with neural networks.
- Quality: In our experience, for most structured datasets, the quality of neural-network models is not necessarily better than that of random forests and Where DL excels is actually when there’s unstructured data involved. In other words, images, text, or audio. The bottom line: Don’t use a shotgun to kill a fly. ML algorithms such as random forest and XGBoost are sufficient for most structured supervised problems, being also simpler to tune, run, and explain. Let DL speak for itself in unstructured data problems or for reinforcement learning.
10. Don’t Let the Data Leak
While working on a project to predict arrival delay of flights, it was noticed that the model suddenly reached 99% accuracy when using all the features available in the dataset. This was due to using the departure delay as a predictor for the arrival delay. This is a typical example of data leakage, which occurs when any of the features used to create the model will be unavailable or unknown at prediction time. So be warned.
Open Source Gives Us Everything. Why Do We Need a Platform?
It has never been easier to build a machine-learning model. A few lines of R or Python code will suffice for such an endeavor, and there’s plenty of resources and tutorials online to train even a complex neural network. For data preparation, Apache Spark can be really useful, even scaling to large datasets. And tools like Docker™ (containerization software) and Plumbr (an application performance-monitoring tool) ease the deployment of machine-learning models through HTTP requests. So, it looks like one could build an end-to-end ML system purely using the open-source stack. Right?
This may be true for building proofs of concept. A graduate student working on his dissertation would certainly be covered under the umbrella of the open source. However, for the enterprise, the story is a bit different.
We are big fans of open source, and many open-source tools are available. But at the same time, there are also quite a few gaps. Here are some of the reasons why enterprises choose data science platforms:
- Open-source integration: Up and running in minutes, support for multiple environments, and transparent version updates
- Collaboration: Easy sharing of datasets, data connections, code, models, environments, and deployments
- Governance and security: Not only over data, but over all analytics assets
- Model management, deployment, and retraining
- Model bias: Detect and correct a model that’s biased by things like gender or age
- Assisted data curation: Visual tools to address the most painful task in data science
- Graphics processing units (GPUs): Immediate provisioning and configuration for optimal performance of deep-learning frameworks (e.g., TensorFlow)
- Codeless modeling: For statisticians, subject matter experts, and even executives who don’t code but want to build models visually
An integrated data science platform should be able to provide all of the above and more so that the end-user does not have to be a systems integrator.
Look for another AI: Evolution and Revolution excerpt in an upcoming issue of MC Systems Insight. Can't wait? Pick up your copy of, Artificial Intelligence: Evolution and Revolution at the MC Press Bookstore Today!
LATEST COMMENTS
MC Press Online