Credit Decision Making: The Role of Predictive Models and Regulations in Consumer Credit Risk Management.

8 min readSep 19, 2023

Imagine you are a consumer who wants to buy a house. You apply for a loan from a bank, but the decision of whether or not to grant you the loan is not up to a human loan officer. Instead, it is up to a complex mathematical model called a credit risk model.

Credit risk models are used by banks and other financial institutions to assess the likelihood that a borrower will repay a loan. The models are based on a variety of data, including the borrower’s credit history, income, and employment status. The models help banks make informed lending decisions. By using these models, banks can reduce the risk of default and protect their profits.

This makes Credit risk modeling a vital part of consumer lending. It helps lenders make informed decisions about whether or not to grant loans to borrowers. In the past, credit risk modeling was based on a mix of expert knowledge and traditional statistical models. However, in recent years, machine learning algorithms like XGBoost have revolutionized the field.

Background: Consumer Loans and How Lenders Make Decisions

When a consumer asks for a loan, the credit institution have to make a decision on whether or not to grant it. Depending on the case, the amount of automation in the process may vary. However, it is very likely that the decision will be informed by scores that estimate the probability that the loan will or will not be repaid as expected.

Scores are routinely used at different stages of the process:

Prescreening: At the prescreen stage, a score computed with a small number of features allows the institution to quickly discard some applications.
Underwriting: At the underwriting stage, a score computed with all the required information gives a more precise basis for the decision.
Portfolio Risk Assessment: After the underwriting stage, scores can be used to assess the risk associated with loans in the portfolio.

Analytics methods have been used for decades to compute these probabilities. For example, the FICO score has been used since 1995 in the United States. Given the direct impact they have on the institutions’ revenues and on customers’ lives, these predictive models have always been under great scrutiny. Consequently, processes, methods, and skills have been formalized into a highly regulated environment to ensure the sustainable performance of models.

Whether the models are based on expert-made rules, on classical statistical models, or on more recent machine learning algorithms, they all have to comply with similar regulations. Consumer credit risk management can therefore be seen as a precursor of MLOps: parallels with other use cases as well as best practices can be analyzed based on this use case. At the time a credit decision is made, information about the customer’s historical and current situation is usually available. How much credit does the customer hold? Has the customer ever not repaid a loan (in credit jargon, is the customer a delinquent)? In some countries, organizations called credit bureaus collect this information and make it available to creditors either directly or through the form of a score (like the aforementioned FICO score).

The definition of the target to be predicted is more complex. A customer not repaying as expected is a “bad” outcome in credit risk modeling. In theory, one should wait for the complete repayment to determine a “good” outcome and for the loss charge off to determine a “bad” outcome. However, it may take a long time to obtain these ultimate figures, and waiting for them would deter reactivity to changing conditions. As a result, trade-offs are usually made, based on various indicators, to declare “bad” outcomes before the losses are certain.

Model development — The Evolution of Credit Risk Modeling: From Expert Knowledge to XGBoost and the Challenge of Model Interpretability

Historically, credit risk modeling is based on a mix of rules (“manual feature engineering” in modern ML jargon) and logistic regression. Expert knowledge is vital to creating a good model. Building adapted customer segmentation as well as studying the influence of each variable and the interactions between variables requires enormous time and effort. Combined with advanced techniques like two-stage models with offset, advanced general linear models based on Tweedie distribution, or monotonicity constraints on one side and financial risk management techniques on the other side, this makes the field a playground for actuaries.

Gradient boosting algorithms like XGBoost have reduced the cost to build good models. However, their validation is made more complex by the black box effect: it’s hard to get the feeling that such models give sensible results whatever the inputs. Nevertheless, credit risk modelers have learned to use and validate these new types of models. They have developed new validation methodologies based, for example, on individual explanations (like the Shapley values) to build trust into their models, which is a critical component of MLOps.

Model Bias Considerations — Navigating Selection Bias in Model Development for Inclusive Risk Assessment and Decision-Making

The modeler also has to take into account selection biases, as the model will inevitably be used to reject applicants. As a result, the population to which a loan is granted is not representative of the applicant population.

By training a model version on the population selected by the previous model version without care, the data scientist would make a model unable to accurately predict on the rejected population because it is not represented in the training dataset, while it is exactly what is expected from the model. This effect is called cherry-picking. As a result, special methods, like reweighting based on the applicant population or calibrating
the model based on external data, have to be used.

Models that are used for risk assessment and not only to make decisions about granting loans have to produce probabilities and not only yes/no outcomes. Usually, the probability produced directly by prediction models is not accurate. While it is not an issue if data scientists apply thresholding to obtain a binary classification, they will usually need a monotonous transformation called a calibration to recover “true” probabilities as evaluated on historical data.

The model validation for this use case typically consists of:

Testing Performance on Out-of-Sample Data: Evaluating how well the model performs on datasets that were not part of its training data, chosen either after or sometimes before the training data.
Examining Performance Across Different Subpopulations: Assessing the model’s performance not just in general but also for specific subgroups within the customer base. These subgroups are often based on factors like revenue segments, and in the era of Responsible AI, they may include variables like gender or other attributes protected by regulations.

Prepare for Production: — Thorough Model Validation: Safeguarding the Integrity of Credit Risk Models in a Deliberate MLOps Life Cycle.

Given the significant impact of credit risk models, their validation process involves significant work with regard to the modeling part of the life cycle, and it includes the full documentation of:

Data Usage: A detailed account of the data sources and data used to train the model.
Model and Hypotheses: Documentation of the model itself, including the assumptions and hypotheses made during its construction.
Validation Methodology and Results: Explanation of how the model was validated and the outcomes of this validation.
Monitoring Methodology: Description of how the model’s ongoing performance is monitored. In this context, monitoring encompasses both data and performance changes.

The monitoring methodology in this scenario is twofold: data and performance drift. As the delay between the prediction and obtaining the ground truth is long (typically the duration of the loan plus a few months to take into account late payments), it is not enough to monitor the model performance: data drift also has to be monitored carefully.

For example, should an economic recession occur or should the commercial policy change, it is likely that the applicant population would change in such a way that the model’s performance could not be guaranteed without further validation. Data drift is usually performed by customer segment with generic statistical metrics that measure distances between probability distributions (like Kolmogorov-Smirnov or Wasserstein distances) and also with metrics that are specific to financial services, like population stability index and characteristic stability index. Performance drift is also regularly assessed on subpopulations with generic metrics (AUC) or specific metrics (Kolmogorov-Smirnov, Gini).

The model documentation is usually reviewed by an MRM team in a very formal and standalone process. Such an independent review is a good practice to make sure that the right questions are asked of the model development team. In some critical cases, the validation team may rebuild the model from scratch given the documentation. In some cases, the second implementation is made using an alternative technology to
establish confidence in documented understanding of the model and to highlight unseen bugs deriving from the original toolset.

Complex and time-consuming model validation processes have an implication on the entire MLOps life cycle. Quick-fixes and rapid model iteration are not possible with such lengthy QA and lead to a very slow and deliberate MLOps life cycle.

Deploy to Production — Navigating the Divergence: Deploying and Monitoring Models in Financial Organizations

In a typical large financial services organization, the production environment is not only separate from the design environment, but also likely to be based on a different technical stack. The technical stack for critical operations — like transaction validation, but also potentially loan validation — will always evolve slowly.

Historically, the production environments have mainly supported rules and linear models like logistic regression. Some can handle more complex models such as PMML or JAR file. For less critical use cases, Docker deployment or deployment through integrated data science and machine learning platforms may be possible. As a result, the operationalization of the model may involve operations that range from clicking on a button to writing a formula based on a Microsoft Word document.

Activity logging of the deployed model is essential for monitoring model performance in such a high-value use case. Depending on the frequency of the monitoring, the feedback loop may be automated or not. For example, automation may not be necessary if the task is performed only once or twice a year and the largest amount of time is spent asking questions of the data. On the other hand, automation might be essential if the assessment is done weekly, which may be the case for short-term loans with durations of a few months.

In summary — Adapting Model Validation Practices for Effective MLOps Across Industries

The financial services sector has honed its approaches to model validation and monitoring over several decades. These practices have evolved to accommodate new modeling technologies, such as gradient boosting methods. Due to their significant impact, the procedures related to managing the life cycle of these models are well-established and have even been integrated into various regulatory frameworks. Consequently, they offer a valuable source of best practices for MLOps in other fields. However, it’s essential to recognize that adaptations may be necessary because the balance between model robustness on one side and cost efficiency, time to achieve results, and team satisfaction on the other may differ in various industries and contexts.