Why Analyse Risks and Opportunities?
A company’s economic success relies on the decisions of its customers. When customers cancel a subscription (churn) or decide for a purchase, it has a direct influence on the company’s revenue. With predictive methods based on data companies can predict decisions early on and understand root causes.
When predicting events with negative impact we are talking about risk prediction. With the same methodological approach, we can also predict events with positive effect on the company’s success.
Specific scenarios for risk prediction are:
- Churn prediction where we want to create an estimate about which customer is likely to not come back or cancel a subscription
- Credit risk scoring where we want to estimate the probability of a credit defaulting
Specific scenarios with positive impact are:
- Winback analysis where the likelihood of a customer returning after a churn event is estimated
- Campaign optimization where the conversion of participants in the campaign is to be estimated
There are many different scenarios like the ones above. They all share similar aspects:
- The estimation aims to predict human decisions
The event of interest is usually rare and valuable (meaning beneficial or costly)
Data Acquisition and Preparation
At the foundation of any predictive project is the data acquisition and following preparation. For the project to succeed, the data of the customers and their transactions have to be combined and brought into a shape that is suitable for analysis. Especially in early analytical projects, this step can make up a substantial part of the project.
When scanning through the available data, the most relevant features need to be selected. Interesting customer properties…
- …enable to compare customers.
- …reflect the interests of customers by quantifying accepted offers, summarizing customer complaints, or reactions to earlier campaigns.
- …describe the reactiveness of customers by the duration of the relationship, the time since the last contact or the number of contacts.
During data acquisition it is also necessary to evaluate and exclude data from the analysis:
- Potentially discriminating properties
- Analytically irrelevant properties (like unique identifiers)
- Properties where consistent availability is not guaranteed
- Other properties might be anonymized to reduce unnecessary risks
In the following data preparation step, the data is corrected so that it is better suited for analysis. Here it is often necessary to reduce complexity of the data (by, for example, combining product codes to fewer product groups) and to make the data comparable (by, for example, converting time points to intervals).
Two further aspects must be considered:
- All these acquisition and preparation steps need to be implemented as a process so that they can be repeated. This is necessary because new data will be collected in the future and must be processed accordingly.
- There will be many modifications to these steps. This is because in the subsequent analytical steps new issues will be uncovered and need to be fixed here and newly collected data will show new issues that need to be tackled.
The event of interest (purchase decision, churn, etc.) is most likely rare when compared to the non-event. This analytical issue can to some extent be tackled during data preparation (for example through using stratified samples), and also during the following modelling phase (through tuning the right parameters or using case weights).
Modelling
After a viable analytical data set has been created a first predictive model can be created. This step needs to respect all regulatory and organizational requirements. Therefore, regression models are a popular model type:
- They are available in every data science toolkit
- Can be interpreted easily and thus avoid creation of black-box-models
- A fitted model can be calculated very efficiently and is easy to deploy
- Can be converted into a so-called scorecard (relevant for credit risk modelling)
Despite these advantages, regression models are limited in terms of the complexity in the data that they can handle. This makes more complex model types viable candidates as well. Those – ranked by increasing complexity - are decision trees (like CART and CHAID), ensemble models (like Random Forests or Boosting), support vector machines and artificial neural networks.
These models come with less restrictions regarding data quality and complexity, but will require deeper understanding, more computing power and are in general more difficult to analyse.
Evaluation of Model Performance
When the model types have been selected, and model candidates have been trained, these can be evaluated.
- Comparison of model candidates
- Assessment regarding the scenario
The comparison of multiple model candidates enables us to
- Get a benchmark for the expected model performance
- Allows to selection the most appropriate model type for the given scenario
- Estimation of the performance penalty when a different model must be selected due to external requirements
When evaluating a single model regarding the capability to perform for the given scenario, the initial results might not be satisfactory. The hit rate (proportion of correctly predicted observations, especially for the event of interest) of a given model will usually be low.
Keep in mind though: even the best predictive model will not be able to “look inside peoples’ minds”. Human decisions are influenced by a multitude of factors, from which most will not be represented in the company’s data sources. (Looking at it from another perspective: if you can achieve a high hit rate, you should check your data. It might very well be that the model has processed inputs that were a result of the outcome – and should hence not be part of the dataset)
Instead of evaluating the hit rate, other evaluation techniques (depending on the scenario) should be applied. For example, in Win-Back analysis and in campaign optimization it is more common to evaluate the capability of the model to sort the customers by probability. This can be evaluated with gains or lift charts. The sorted data can then be used to allocate a given budget (for incentives or campaign activities) in the most efficient way.
Model Deployment and Monitoring
After a successful development phase, the models are moved into production. This requires that the models are integrated with the company’s systems to process data and create predictions. The models can be executed on a scheduled basis or demand-based and the calculated predictions need to be stored accordingly to process them further and/or make them accessible to the experts.
The integration of the models should include that they can be updated easily so that regular model improvements are possible. This can include approval mechanisms (like four eyes principle) and version management.
In use the models must be monitored continuously and with each execution, the prediction should be stored with a time stamp and the model version to allow traceability. Through the monitoring it is possible to estimate when a model update is necessary, guaranteeing reliable operation.
Next Steps
Risk prediction is – as described initially – an important topic for many companies. Luckily there exist a lot of different solutions, be it with commercial tools like Spotfire, Alteryx Designer or Statistica or with custom solutions developed in R or Python.
To decide about the best “How” companies should take into consideration how far the automation and integration should go, what expertise level the data science team and consumers have and how the insights should be propagated in the organization.
Our team will gladly assist you with these decisions!