Introduction to Overfitting

By November 1, 2018RankMiner Team Blogs

We have often found that potential clients have an apprehension with implementing a predictive model because of a previous bad experience with the modeling process. Many times the cause of these bad experiences is a model that has been overfitted to the historical data it has been trained on.

These overfitted models create a false success rate on the test data, but when tested on new data or put into production the model performs no better than random guessing at best. In the case of call centers, this leads to misinformed decisions in areas such as Quality Assurance, Customer Attrition and Agent Attrition. All three areas are major profit drivers in a call center and the results from relying on the information of an overfitted model would make anyone hesitant in a future foray into predictive modeling.

However, predictive modeling is not something to be wary of because when correctly tested and validated, the increase in valuable insights is inumerable. The key to successful predictive modeling is thorough testing and any company looking to implement a predictive modeling solution should ask as much about the testing process as the training process.

Why Some Models Don’t Perform as Expected

Good models are able to generalize the training data to provide insight into patterns that may exist across data in the future. However, there is a problem if your model works much better on the training set than on the validation (or test) set. The model has memorized individual training examples to some extent, rather than approximating more generalized exemplars.  This is a phenomenon called overfitting, and it is one of the biggest challenges that a machine learning engineer must combat.

This becomes an even more significant issue in deep learning, where our neural networks have large numbers of layers containing many neurons. The number of connections in these models is astronomical, reaching the millions. As a result, overfitting is commonplace.

Each of the following graphs depicts the same set of data. If you were to model this data, you would prefer the Good Fit / Robust chart. In this case, the line representing the model gives a clear indication of the trends in the actual data.

The underfitted graph does not really give an accurate indication of the trends of your data at all. It only provides a vague hint of the overall trend.

The overfitted graph has almost the opposite problem. In this case, your trend graph tries to recognize and explain every small variation. In this case, the end result is that the model better represents a rollercoaster than an accurate representation of the data trends.

A good rule of thumb is:

  • if training error is high and cross-validation error is also high, your model is underfitted
  • if training error is low and cross-validation error is high, your model is overfitted

How do we Avoid Overfitting Our Models?

It should be standard practice for all data scientists within any organization to use rigorous testing methods when they train a model.  At the very least, an entirely separate test set, with uncontaminated samples across training test set samples (time, agent_id, etc.) must be used.

Cross-validation techniques should be further employed for any exploration of the classification model used, the model hyper-parameters, or the feature space.

As a bonus, cross-validation provides a separate but relevant validation score that is a prediction on the performance of the test set.

As we saw with our new client, it only takes one unfortunate experience with poorly tested and misunderstood data, for some people to lose their confidence and trust in predictive models, no matter how accurate and faithful to reality they may be.

What This Means for Call Centers

As can be seen above, predictive models that are overfit provide no value to an organization, but those models that are trained with uncontaminated data and cross validated can provide insights into the future actions of a client or agent. A well trained predictive model allows supervisors and administrators to make more informed decisions quicker, thus improving agent performance.

No matter if a call center is service focused, collections focused or sales focused, among others, predictive models such as RankMiner’s predict and prescribe next actions for a call center to turn micro level changes into macro level success.