Title: | Collection of Tools for PD Rating Model Development and Validation |
---|---|
Description: | The goal of this package is to cover the most common steps in probability of default (PD) rating model development and validation. The main procedures available are those that refer to univariate, bivariate, multivariate analysis, calibration and validation. Along with accompanied 'monobin' and 'monobinShiny' packages, 'PDtoolkit' provides functions which are suitable for different data transformation and modeling tasks such as: imputations, monotonic binning of numeric risk factors, binning of categorical risk factors, weights of evidence (WoE) and information value (IV) calculations, WoE coding (replacement of risk factors modalities with WoE values), risk factor clustering, area under curve (AUC) calculation and others. Additionally, package provides set of validation functions for testing homogeneity, heterogeneity, discriminatory and predictive power of the model. |
Authors: | Andrija Djurovic [aut, cre] |
Maintainer: | Andrija Djurovic <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.2.1 |
Built: | 2025-02-19 03:59:17 UTC |
Source: | https://github.com/andrija-djurovic/pdtoolkit |
auc.model
calculates area under curve (AUC) for a given predicted values and observed target variable.
auc.model(predictions, observed)
auc.model(predictions, observed)
predictions |
Model predictions. |
observed |
Observed values of target variable. |
The command auc.model
returns value of AUC.
bivariate
for automatic bivariate analysis.
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factor gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] #estimate simple logistic regression model lr <- glm(qual ~ maturity.bin, family = "binomial", data = gcd) #calculate auc auc.model(predictions = predict(lr, type = "response", newdata = gcd), observed = gcd$qual)
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factor gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] #estimate simple logistic regression model lr <- glm(qual ~ maturity.bin, family = "binomial", data = gcd) #calculate auc auc.model(predictions = predict(lr, type = "response", newdata = gcd), observed = gcd$qual)
bivariate
returns the bivariate statistics for risk factors supplied in data frame db
.
Implemented procedure expects all risk factors to be categorical, thus numeric risk factors should be first
categorized. Additionally, maximum number of groups per risk factor is set to 10, so risk factors with more than
10 categories will not be processed automatically, but manual inspection can be still done using woe.tbl
and auc.model
functions in order to produce the same statistics. Results of both checks (risk factor class and
number of categories), if identified, will be reported in second element of function output - info
data frame.
Bivariate report (first element of function output - results
data frame) includes:
rf: Risk factor name.
bin: Risk factor group (bin).
no: Number of observations per bin.
ng: Number of good cases (where target is equal to 0) per bin.
nb: Number of bad cases (where target is equal to 1) per bin.
pct.o: Percentage of observations per bin.
pct.g: Percentage of good cases (where target is equal to 0) per bin.
pct.b: Percentage of bad cases (where target is equal to 1) per bin.
dr: Default rate per bin.
so: Number of all observations.
sg: Number of all good cases.
sb: Number of all bad cases.
dist.g: Distribution of good cases per bin.
dist.b: Distribution of bad cases per bin.
woe: WoE value.
iv.b: Information value per bin.
iv.s: Information value of risk factor (sum of individual bins' information values).
auc: Area under curve of simple logistic regression model estimated as y ~ x
,
where y
is selected target variable and x
is categorical risk factor.
Additional info report (second element of function output - info
data frame), if produced, includes:
rf: Risk factor name.
reason.code: Reason code takes value 1 if inappropriate class of risk factor is identified, while for check of maximum number of categories it takes value 2.
comment: Reason description.
bivariate(db, target)
bivariate(db, target)
db |
Data frame of risk factors and target variable supplied for bivariate analysis. |
target |
Name of target variable within |
The command bivariate
returns the list of two data frames. The first one contains bivariate metrics
while the second data frame reports results of above explained validations
(class of the risk factors and number of categories).
woe.tbl
and auc.model
for manual bivariate analysis.
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factors gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual)[[2]] gcd$age.bin.1 <- cut(x = gcd$age, breaks = 20) gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual)[[2]] str(gcd) #select target variable and categorized risk factors gcd.bin <- gcd[, c("qual", "age.bin", "maturity.bin", "amount.bin")] #run bivariate analysis on data frame with only categorical risk factors bivariate(db = gcd.bin, target = "qual") #run bivariate analysis on data frame with mixed risk factors (categorical and numeric). #for this example info table is produced bivariate(db = gcd, target = "qual") #run woe table for risk factor with more than 10 modalities woe.tbl(tbl = gcd, x = "age.bin.1", y = "qual") #calculate auc for risk factor with more than 10 modalities lr <- glm(qual ~ age.bin.1, family = "binomial", data = gcd) auc.model(predictions = predict(lr, type = "response", newdata = gcd), observed = gcd$qual)
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factors gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual)[[2]] gcd$age.bin.1 <- cut(x = gcd$age, breaks = 20) gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual)[[2]] str(gcd) #select target variable and categorized risk factors gcd.bin <- gcd[, c("qual", "age.bin", "maturity.bin", "amount.bin")] #run bivariate analysis on data frame with only categorical risk factors bivariate(db = gcd.bin, target = "qual") #run bivariate analysis on data frame with mixed risk factors (categorical and numeric). #for this example info table is produced bivariate(db = gcd, target = "qual") #run woe table for risk factor with more than 10 modalities woe.tbl(tbl = gcd, x = "age.bin.1", y = "qual") #calculate auc for risk factor with more than 10 modalities lr <- glm(qual ~ age.bin.1, family = "binomial", data = gcd) auc.model(predictions = predict(lr, type = "response", newdata = gcd), observed = gcd$qual)
boots.vld
performs bootstrap model validation.
The goal of this procedure is to generate main model performance metrics such as absolute mean
square error, root mean square error or area under curve (AUC) based on resampling method.
boots.vld(model, B = 1000, seed = 1122)
boots.vld(model, B = 1000, seed = 1122)
model |
Model in use, an object of class inheriting from |
B |
Number of bootstrap samples. Default is set to 1000. |
seed |
Random seed needed for ensuring the result reproducibility. Default is 1122. |
The command boots.vld
returns a list of two objects.
The first object (iter
), returns iteration performance metrics.
The second object (summary
), is the data frame of iterations averages of performance metrics.
suppressMessages(library(PDtoolkit)) data(loans) #run stepFWD res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #print head of coded development data head(res$dev.db) #calculate AUC auc.model(predictions = predict(final.model, type = "response", newdata = res$dev.db), observed = res$dev.db$Creditability) boots.vld (model = final.model, B = 10, seed = 1122)
suppressMessages(library(PDtoolkit)) data(loans) #run stepFWD res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #print head of coded development data head(res$dev.db) #calculate AUC auc.model(predictions = predict(final.model, type = "response", newdata = res$dev.db), observed = res$dev.db$Creditability) boots.vld (model = final.model, B = 10, seed = 1122)
cat.bin
implements three-stage binning procedure for categorical risk factors.
The first stage is possible correction for minimum percentage of observations.
The second stage is possible correction for target rate (default rate), while the third one is
possible correction for maximum number of bins. Last stage implements procedure known as
adjacent pooling algorithm (APA) which aims to minimize information loss while iterative merging of the bins.
cat.bin( x, y, sc = NA, sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.01, max.groups = NA, force.trend = "modalities" )
cat.bin( x, y, sc = NA, sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.01, max.groups = NA, force.trend = "modalities" )
x |
Categorical risk factor. |
y |
Numeric target vector (binary). |
sc |
Special case elements. Default value is |
sc.merge |
Define how special cases will be treated. Available options are: |
min.pct.obs |
Minimum percentage of observations per bin. Default is 0.05 or minimum 30 observations. |
min.avg.rate |
Minimum default rate. Default is 0.01 or minimum 1 bad case for |
max.groups |
Maximum number of bins (groups) allowed for analyzed risk factor. If in the first two stages
number of bins is less or equal to selected |
force.trend |
Defines how initial summary table will be ordered. Possible options are: |
The command cat.bin
generates a list of two objects. The first object, data frame summary.tbl
presents a summary table of final binning, while x.trans
is a vector of new grouping values.
Anderson, R. (2007). The credit scoring toolkit: theory and practice for retail credit risk management and decision automation, Oxford University Press
suppressMessages(library(PDtoolkit)) data(loans) #prepare risk factor Purpose for the analysis loans$Purpose <- ifelse(nchar(loans$Purpose) == 2, loans$Purpose, paste0("0", loans$Purpose)) #artificially add missing values in order to show functions' features loans$Purpose[1:6] <- NA #run binning procedure res <- cat.bin(x = loans$Purpose, y = loans$Creditability, sc = NA, sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.05, max.groups = NA, force.trend = "modalities") res[[1]] #check new risk factor against the original table(loans$Purpose, res[[2]], useNA = "always") #repeat the same process with setting max.groups to 4 and force.trend to dr res <- cat.bin(x = loans$Purpose, y = loans$Creditability, sc = NA, sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.05, max.groups = 4, force.trend = "dr") res[[1]] #check new risk factor against the original table(loans$Purpose, res[[2]], useNA = "always") #example of shrinking number of groups for numeric risk factor #copy exisitng numeric risk factor to new called maturity loans$maturity <- loans$"Duration of Credit (month)" #artificially add missing values in order to show functions' features loans$maturity[1:10] <- NA #categorize maturity with MAPA algorithim from monobin package loans$maturity.bin <- cum.bin(x = loans$maturity, y = loans$Creditability, g = 50)[[2]] table(loans$maturity.bin) #run binning procedure to decrease number of bins from the previous step res <- cat.bin(x = loans$maturity.bin, y = loans$Creditability, sc = "SC", sc.merge = "closest", min.pct.obs = 0.05, min.avg.rate = 0.01, max.groups = 5, force.trend = "modalities") res[[1]] #check new risk factor against the original table(loans$maturity.bin, res[[2]], useNA = "always")
suppressMessages(library(PDtoolkit)) data(loans) #prepare risk factor Purpose for the analysis loans$Purpose <- ifelse(nchar(loans$Purpose) == 2, loans$Purpose, paste0("0", loans$Purpose)) #artificially add missing values in order to show functions' features loans$Purpose[1:6] <- NA #run binning procedure res <- cat.bin(x = loans$Purpose, y = loans$Creditability, sc = NA, sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.05, max.groups = NA, force.trend = "modalities") res[[1]] #check new risk factor against the original table(loans$Purpose, res[[2]], useNA = "always") #repeat the same process with setting max.groups to 4 and force.trend to dr res <- cat.bin(x = loans$Purpose, y = loans$Creditability, sc = NA, sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.05, max.groups = 4, force.trend = "dr") res[[1]] #check new risk factor against the original table(loans$Purpose, res[[2]], useNA = "always") #example of shrinking number of groups for numeric risk factor #copy exisitng numeric risk factor to new called maturity loans$maturity <- loans$"Duration of Credit (month)" #artificially add missing values in order to show functions' features loans$maturity[1:10] <- NA #categorize maturity with MAPA algorithim from monobin package loans$maturity.bin <- cum.bin(x = loans$maturity, y = loans$Creditability, g = 50)[[2]] table(loans$maturity.bin) #run binning procedure to decrease number of bins from the previous step res <- cat.bin(x = loans$maturity.bin, y = loans$Creditability, sc = "SC", sc.merge = "closest", min.pct.obs = 0.05, min.avg.rate = 0.01, max.groups = 5, force.trend = "modalities") res[[1]] #check new risk factor against the original table(loans$maturity.bin, res[[2]], useNA = "always")
cat.slice
implements manual re-coding of character vector values for a given mapping scheme.
This procedure is one of the helper functions which are handy for the model monitoring phase
(i.e. after model implementation).
cat.slice(x, mapping, sc = NA, sc.r = "SC")
cat.slice(x, mapping, sc = NA, sc.r = "SC")
x |
Character vector to be re-coded. |
mapping |
Data frame with compulsory columns: |
sc |
Character vector with special case elements. Default value is |
sc.r |
Character vector used for replacement of special cases. If supplied as one
element vector, it will be recycled to the length of |
The command cat.slice
returns vector of re-coded values and special cases.
suppressMessages(library(PDtoolkit)) data(gcd) x <- gcd$maturity #artificially add some special values x[1:5] <- Inf x[6:7] <- NA mbin <- cum.bin(x = x, y = gcd$qual, sc.method = "together") mbin[[1]] gcd$x <- mbin[[2]] cb <- cat.bin(x = gcd$x, y = gcd$qual, sc = "SC", sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.05) x <- gcd$x mapping <- data.frame(x.orig = x, x.mapp = cb[[2]])%>% group_by(x.orig, x.mapp) %>% summarise(n = n(), .groups = "drop") mapping <- data.frame(mapping[, -3]) sc <- cat.slice(x = x, mapping = mapping, sc = NA, sc.r = "SC") #compare automatic and manual re-coding table(cb[[2]], useNA = "always") table(sc, useNA = "always")
suppressMessages(library(PDtoolkit)) data(gcd) x <- gcd$maturity #artificially add some special values x[1:5] <- Inf x[6:7] <- NA mbin <- cum.bin(x = x, y = gcd$qual, sc.method = "together") mbin[[1]] gcd$x <- mbin[[2]] cb <- cat.bin(x = gcd$x, y = gcd$qual, sc = "SC", sc.merge = "none", min.pct.obs = 0.05, min.avg.rate = 0.05) x <- gcd$x mapping <- data.frame(x.orig = x, x.mapp = cb[[2]])%>% group_by(x.orig, x.mapp) %>% summarise(n = n(), .groups = "drop") mapping <- data.frame(mapping[, -3]) sc <- cat.slice(x = x, mapping = mapping, sc = NA, sc.r = "SC") #compare automatic and manual re-coding table(cb[[2]], useNA = "always") table(sc, useNA = "always")
confusion.matrix
returns confusion matrix along with accompanied performance metrics.
confusion.matrix(predictions, observed, cutoff)
confusion.matrix(predictions, observed, cutoff)
predictions |
Model predictions. |
observed |
Observed values of target variable. |
cutoff |
Cutoff value. Single value numeric vector between 0 and 1. |
The command confusion.matrix
returns list of two objects. The first object is confusion matrix table,
while the second one is data frame with accompanied performance metrics.
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using mdt.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) mdt.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) names(res) summary(res$model)$coefficients loans$model.pred <- predict(res$model, type = "response") #confusion matrix confusion.matrix(predictions = predict(res$model, type = "response"), observed = loans$"Creditability", cutoff = 0.5)
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using mdt.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) mdt.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) names(res) summary(res$model)$coefficients loans$model.pred <- predict(res$model, type = "response") #confusion matrix confusion.matrix(predictions = predict(res$model, type = "response"), observed = loans$"Creditability", cutoff = 0.5)
constrained.logit
performs estimation of logistic regression with constrains on values of the
estimated coefficient.
constrained.logit(db, x, y, lower, upper)
constrained.logit(db, x, y, lower, upper)
db |
Data set of risk factors and target variable. |
x |
Character vector of risk factors (independent variables) used in logistic regression. |
y |
Character vector of target (dependent variable) used in logistic regression. |
lower |
Numeric vector of lower boundaries of the coefficients. This vector should contain value of the intercept,
therefore number of elements should be equal to number of elements of the argument |
upper |
Numeric vector of upper boundaries of the coefficients. This vector should contain value of the intercept,
therefore number of elements should be equal to number of elements of the argument |
The command constrained.logit
returns list of two vectors. The first vector contains values of the
estimated coefficients, while the second vector contains predictions of the constrained logistic regression.
suppressMessages(library(PDtoolkit)) data(loans) #model 1 reg.1 <- glm(Creditability ~ `Account Balance`, family = "binomial", data = loans) summary(reg.1)$coefficient loans$pred.1 <- unname(predict(reg.1, type = "response")) #model 2 reg.2 <- glm(Creditability ~ `Age (years)`, family = "binomial", data = loans) summary(reg.2)$coefficient loans$pred.2 <- unname(predict(reg.2, type = "response")) #integration fm <- glm(Creditability ~ pred.1 + pred.2, family = "binomial", data = loans) summary(fm)$coefficient fm.pred <- predict(fm, type = "response", newdata = loans) auc.model(predictions = fm.pred, observed = loans$Creditability) #constrained integration (regression) cl.r <- constrained.logit(db = loans, x = c("pred.1", "pred.2"), y = "Creditability", lower = c(-Inf, -Inf, -Inf), upper = c(Inf, 4.5, Inf)) names(cl.r) cl.r[["beta"]] auc.model(predictions = cl.r[["prediction"]], observed = loans$Creditability)
suppressMessages(library(PDtoolkit)) data(loans) #model 1 reg.1 <- glm(Creditability ~ `Account Balance`, family = "binomial", data = loans) summary(reg.1)$coefficient loans$pred.1 <- unname(predict(reg.1, type = "response")) #model 2 reg.2 <- glm(Creditability ~ `Age (years)`, family = "binomial", data = loans) summary(reg.2)$coefficient loans$pred.2 <- unname(predict(reg.2, type = "response")) #integration fm <- glm(Creditability ~ pred.1 + pred.2, family = "binomial", data = loans) summary(fm)$coefficient fm.pred <- predict(fm, type = "response", newdata = loans) auc.model(predictions = fm.pred, observed = loans$Creditability) #constrained integration (regression) cl.r <- constrained.logit(db = loans, x = c("pred.1", "pred.2"), y = "Creditability", lower = c(-Inf, -Inf, -Inf), upper = c(Inf, 4.5, Inf)) names(cl.r) cl.r[["beta"]] auc.model(predictions = cl.r[["prediction"]], observed = loans$Creditability)
create.partitions
performs creation of partitions (aka nested dummy variables).
Using directly into logistic regression, partitions provide insight into difference of log-odds of adjacent risk factor bins (groups).
Adjacent bins are selected based on alphabetic order of analyzed risk factor modalities, therefore it is important
to ensure that modality labels are defined in line with expected monotonicity or any other criterion
that is considered while engineering the risk factors.
create.partitions(db)
create.partitions(db)
db |
Data set of risk factors to be converted into partitions. |
The command create.partitions
returns a list of two objects (data frames).
The first object (partitions
), returns the data set with newly created nested dummy variables.
The second object (info
), is the data frame that returns info on partition process.
Set of quality checks are performed and reported if any of them observed. Two of them are of terminal nature
i.e. if observed, risk factor is not processed further (less then two non-missing groups and more than 10 modalities)
while the one provides only info (warning) as usually deviates from the main principles of risk factor processing
(less than 5% of observations per bin).
Scallan, G. (2011). Class(ic) Scorecards: Selecting Characteristics and Attributes in Logistic Regression, Edinburgh Credit Scoring Conference.
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) cum.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) loans.p <- create.partitions(db = loans[, num.rf]) head(loans.p[["partitions"]]) loans.p[["info"]] #bring target to partitions db.p <- cbind.data.frame(Creditability = loans$Creditability, loans.p[[1]]) #prepare risk factors for stepMIV db.p[, -1] <- sapply(db.p[, -1], as.character) #run stepMIV res <- stepMIV(start.model = Creditability ~ 1, miv.threshold = 0.02, m.ch.p.val = 0.05, coding = "dummy", db = db.p) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) cum.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) loans.p <- create.partitions(db = loans[, num.rf]) head(loans.p[["partitions"]]) loans.p[["info"]] #bring target to partitions db.p <- cbind.data.frame(Creditability = loans$Creditability, loans.p[[1]]) #prepare risk factors for stepMIV db.p[, -1] <- sapply(db.p[, -1], as.character) #run stepMIV res <- stepMIV(start.model = Creditability ~ 1, miv.threshold = 0.02, m.ch.p.val = 0.05, coding = "dummy", db = db.p) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients
cutoff.palette
returns confusion matrix along with accompanied performance metrics.
cutoff.palette(predictions, observed, min.pct.obs = 0.05, min.pct.def = 0.01)
cutoff.palette(predictions, observed, min.pct.obs = 0.05, min.pct.def = 0.01)
predictions |
Model predictions. |
observed |
Observed values of target variable. |
min.pct.obs |
Minimum percentage of observations. Used to select boundaries of cutoff values. Default value is 0.05. |
min.pct.def |
Minimum percentage of default. Used to select boundaries of cutoff values. Default value is 0.01. |
The command cutoff.palette
returns data frame with minimum and maximum values of each confusion
matrix metric along with optimized cutoff itself.
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using mdt.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) mdt.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) #run cutoff optimization cop <- cutoff.palette(predictions = predict(res$model, type = "response"), observed = loans$"Creditability", min.pct.obs = 0.05, min.pct.def = 0.01) cop confusion.matrix(predictions = predict(res$model, type = "response"), observed = loans$"Creditability", cutoff = cop$cutoff.max[cop$metric%in%"f1.score"])
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using mdt.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) mdt.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) #run cutoff optimization cop <- cutoff.palette(predictions = predict(res$model, type = "response"), observed = loans$"Creditability", min.pct.obs = 0.05, min.pct.def = 0.01) cop confusion.matrix(predictions = predict(res$model, type = "response"), observed = loans$"Creditability", cutoff = cop$cutoff.max[cop$metric%in%"f1.score"])
decision.tree
runs customized decision tree algorithm. Customization refers to minimum
percentage of observations and defaults in each node, maximum tree depth, monotonicity condition
at each splitting node and statistical test (test of two proportions) used for node splitting.
decision.tree( db, rf, target, min.pct.obs = 0.05, min.avg.rate = 0.01, p.value = 0.5, max.depth = NA, monotonicity )
decision.tree( db, rf, target, min.pct.obs = 0.05, min.avg.rate = 0.01, p.value = 0.5, max.depth = NA, monotonicity )
db |
Data frame of risk factors and target variable supplied for interaction extraction. |
rf |
Character vector of risk factor names on which decision tree is run. |
target |
Name of target variable (default indicator 0/1) within db argument. |
min.pct.obs |
Minimum percentage of observation in each leaf. Default is 0.05 or 30 observations. |
min.avg.rate |
Minimum percentage of defaults in each leaf. Default is 0.01 or 1 default case. |
p.value |
Significance level of test of two proportions for splitting criteria. Default is 0.05. |
max.depth |
Maximum tree depth. |
monotonicity |
Logical indicator. If |
The command decision.tree
returns a object of class cdt. For details on output elements see the Examples.
suppressMessages(library(PDtoolkit)) data(loans) #modify risk factors in order to show how the function works with missing values loans$"Account Balance"[1:10] <- NA loans$"Duration of Credit (month)"[c(13, 15)] <- NA tree.res <- decision.tree(db = loans, rf = c("Account Balance", "Duration of Credit (month)"), target = "Creditability", min.pct.obs = 0.05, min.avg.rate = 0.01, p.value = 0.05, max.depth = NA, monotonicity = TRUE) str(tree.res)
suppressMessages(library(PDtoolkit)) data(loans) #modify risk factors in order to show how the function works with missing values loans$"Account Balance"[1:10] <- NA loans$"Duration of Credit (month)"[c(13, 15)] <- NA tree.res <- decision.tree(db = loans, rf = c("Account Balance", "Duration of Credit (month)"), target = "Creditability", min.pct.obs = 0.05, min.avg.rate = 0.01, p.value = 0.05, max.depth = NA, monotonicity = TRUE) str(tree.res)
dp.testing
performs testing of discriminatory power of the model in use applied to application portfolio
in comparison to the discriminatory power from the moment of development. Testing is performed based on area
under curve (AUC) from the application portfolio and development sample under assumption that latter is a
deterministic (as given) and that test statistics follow the normal distribution.
Standard error of AUC for application portfolio is calculated as proposed by
Hanley and McNeil (see References).
dp.testing(app.port, def.ind, pdc, auc.test, alternative, alpha = 0.05)
dp.testing(app.port, def.ind, pdc, auc.test, alternative, alpha = 0.05)
app.port |
Application portfolio (data frame) which contains default indicator (0/1) and calibrated probabilities of default (PD) in use. |
def.ind |
Name of the column that represents observed default indicator (0/1). |
pdc |
Name of the column that represent calibrated PD in use. |
auc.test |
Value of tested AUC (usually AUC from development sample). |
alternative |
Alternative hypothesis. Available options are: |
alpha |
Significance level of p-value for hypothesis testing. Default is 0.05. |
Due to the fact that test of discriminatory power is usually implemented on the application portfolio, certain prerequisites are needed to be fulfilled. In the first place model should be developed and rating scale should be formed. In order to reflect appropriate role and right moment of tests application, presented simplified example covers all steps before test implementation.
The command dp.testing
returns a data frame with input parameters along with
hypothesis testing metrics such as estimated difference of observed (application portfolio) and testing AUC,
standard error of observed AUC, p-value of testing procedure and accepted hypothesis.
Hanley J. and McNeil B. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology (1982) 43 (1) pp. 29-36.
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.27 min.pd <- 0.05 rs$pd <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab")[[1]] #check rs sum(rs$pd * rs$no / sum(rs$no)) #bring calibrated PDs to the development sample loans <- merge(loans, rs, by = "rating", all.x = TRUE) #calculate development AUC auc.dev <- auc.model(predictions = loans$pd, observed = loans$Creditability) auc.dev #simulate some dummy application portfolio set.seed(321) app.port <- loans[sample(1:nrow(loans), 400), ] #calculate application portfolio AUC auc.app <- auc.model(predictions = app.port$pd, observed = app.port$Creditability) auc.app #test deterioration of descriminatory power measured by AUC dp.testing(app.port = app.port, def.ind = "Creditability", pdc = "pd", auc.test = 0.7557, alternative = "less", alpha = 0.05)
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.27 min.pd <- 0.05 rs$pd <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab")[[1]] #check rs sum(rs$pd * rs$no / sum(rs$no)) #bring calibrated PDs to the development sample loans <- merge(loans, rs, by = "rating", all.x = TRUE) #calculate development AUC auc.dev <- auc.model(predictions = loans$pd, observed = loans$Creditability) auc.dev #simulate some dummy application portfolio set.seed(321) app.port <- loans[sample(1:nrow(loans), 400), ] #calculate application portfolio AUC auc.app <- auc.model(predictions = app.port$pd, observed = app.port$Creditability) auc.app #test deterioration of descriminatory power measured by AUC dp.testing(app.port = app.port, def.ind = "Creditability", pdc = "pd", auc.test = 0.7557, alternative = "less", alpha = 0.05)
embedded.blocks
performs blockwise regression where the predictions of each blocks' model is used as an
risk factor for the model of the following block.
embedded.blocks( method, target, db, coding = "WoE", blocks, p.value = 0.05, miv.threshold = 0.02, m.ch.p.val = 0.05 )
embedded.blocks( method, target, db, coding = "WoE", blocks, p.value = 0.05, miv.threshold = 0.02, m.ch.p.val = 0.05 )
method |
Regression method applied on each block.
Available methods: |
target |
Name of target variable within |
db |
Modeling data with risk factors and target variable. |
coding |
Type of risk factor coding within the model. Available options are: |
blocks |
Data frame with defined risk factor groups. It has to contain the following columns: |
p.value |
Significance level of p-value for the estimated coefficient. For |
miv.threshold |
MIV (marginal information value) entrance threshold applicable only for code"stepMIV" method. Only the risk factors with MIV higher than the threshold are candidate for the new model. Additional criteria is that MIV value should significantly separate good from bad cases measured by marginal chi-square test. |
m.ch.p.val |
Significance level of p-value for marginal chi-square test applicable only for code"stepMIV" method. This test additionally supports MIV value of candidate risk factor for final decision. |
The command embedded.blocks
returns a list of three objects.
The first object (model
) is the list of the models of each block (an object of class inheriting from "glm"
).
The second object (steps
), is the data frame with risk factors selected from the each block.
The third object (dev.db
), returns the list of block's model development databases.
Anderson, R.A. (2021). Credit Intelligence & Modelling, Many Paths through the Forest of Credit Rating and Scoring, OUP Oxford
staged.blocks
, ensemble.blocks
, stepMIV
, stepFWD
,
stepRPC
, stepFWDr
and stepRPCr
.
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(22) blocks <- data.frame(rf = rf.all, block = sample(1:3, length(rf.all), rep = TRUE)) blocks <- blocks[order(blocks$block), ] blocks #method: stepFWDr res <- embedded.blocks(method = "stepFWDr", target = "Creditability", db = loans, blocks = blocks, p.value = 0.05) names(res) nb <- length(res[["models"]]) res$models[[nb]] auc.model(predictions = predict(res$models[[nb]], type = "response", newdata = res$dev.db[[nb]]), observed = res$dev.db[[nb]]$Creditability)
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(22) blocks <- data.frame(rf = rf.all, block = sample(1:3, length(rf.all), rep = TRUE)) blocks <- blocks[order(blocks$block), ] blocks #method: stepFWDr res <- embedded.blocks(method = "stepFWDr", target = "Creditability", db = loans, blocks = blocks, p.value = 0.05) names(res) nb <- length(res[["models"]]) res$models[[nb]] auc.model(predictions = predict(res$models[[nb]], type = "response", newdata = res$dev.db[[nb]]), observed = res$dev.db[[nb]]$Creditability)
encode.woe
implements replacement of character vector values with WoE values for a given mapping scheme.
This procedure is one of the helper functions which are handy for the model monitoring phase
(i.e. after model implementation).
encode.woe(x, mapping)
encode.woe(x, mapping)
x |
Character vector to be re-coded. |
mapping |
Data frame with compulsory columns: |
The command encode.woe
returns vector of re-coded WoE values.
suppressMessages(library(PDtoolkit)) data(gcd) mbin <- cum.bin(x = gcd$maturity, y = gcd$qual, sc.method = "together") mbin[[1]] table(mbin[[2]], useNA = "always") gcd$x.mod <- mbin[[2]] woe.rep <- replace.woe(db = gcd[, c("qual", "x.mod")], target = "qual") gcd$x.woe <- woe.rep[[1]]$x mapping <- data.frame(x.mod = gcd$x.mod, x.woe = gcd$x.woe)%>% group_by(x.mod, x.woe) %>% summarise(n = n(), .groups = "drop") mapping <- data.frame(mapping[, -3]) ewoe <- encode.woe(x = gcd$x.mod, mapping = mapping) identical(ewoe, woe.rep[[1]]$x)
suppressMessages(library(PDtoolkit)) data(gcd) mbin <- cum.bin(x = gcd$maturity, y = gcd$qual, sc.method = "together") mbin[[1]] table(mbin[[2]], useNA = "always") gcd$x.mod <- mbin[[2]] woe.rep <- replace.woe(db = gcd[, c("qual", "x.mod")], target = "qual") gcd$x.woe <- woe.rep[[1]]$x mapping <- data.frame(x.mod = gcd$x.mod, x.woe = gcd$x.woe)%>% group_by(x.mod, x.woe) %>% summarise(n = n(), .groups = "drop") mapping <- data.frame(mapping[, -3]) ewoe <- encode.woe(x = gcd$x.mod, mapping = mapping) identical(ewoe, woe.rep[[1]]$x)
ensemble.blocks
performs blockwise regression where the predictions of each blocks' model are
integrated into a final model. The final model is estimated in the form of logistic regression without
any check of the estimated coefficients (e.g. statistical significance or sign of the estimated coefficients).
ensemble.blocks( method, target, db, coding = "WoE", blocks, p.value = 0.05, miv.threshold = 0.02, m.ch.p.val = 0.05 )
ensemble.blocks( method, target, db, coding = "WoE", blocks, p.value = 0.05, miv.threshold = 0.02, m.ch.p.val = 0.05 )
method |
Regression method applied on each block.
Available methods: |
target |
Name of target variable within |
db |
Modeling data with risk factors and target variable. |
coding |
Type of risk factor coding within the model. Available options are: |
blocks |
Data frame with defined risk factor groups. It has to contain the following columns: |
p.value |
Significance level of p-value for the estimated coefficient. For |
miv.threshold |
MIV (marginal information value) entrance threshold applicable only for code"stepMIV" method. Only the risk factors with MIV higher than the threshold are candidate for the new model. Additional criteria is that MIV value should significantly separate good from bad cases measured by marginal chi-square test. |
m.ch.p.val |
Significance level of p-value for marginal chi-square test applicable only for code"stepMIV" method. This test additionally supports MIV value of candidate risk factor for final decision. |
The command embeded.blocks
returns a list of three objects.
The first object (model
) is the list of the models of each block (an object of class inheriting from "glm"
).
The second object (steps
), is the data frame with risk factors selected from the each block.
The third object (dev.db
), returns the list of block's model development databases.
Anderson, R.A. (2021). Credit Intelligence & Modelling, Many Paths through the Forest of Credit Rating and Scoring, OUP Oxford
staged.blocks
, embedded.blocks
, stepMIV
, stepFWD
,
stepRPC
, stepFWDr
and stepRPCr
.
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(22) blocks <- data.frame(rf = rf.all, block = sample(1:3, length(rf.all), rep = TRUE)) blocks <- blocks[order(blocks$block), ] blocks #method: stepRPCr res <- ensemble.blocks(method = "stepRPCr", target = "Creditability", db = loans, blocks = blocks, p.value = 0.05) names(res) nb <- length(res[["models"]]) res$models[[nb]] auc.model(predictions = predict(res$models[[nb]], type = "response", newdata = res$dev.db[[nb]]), observed = res$dev.db[[nb]]$Creditability)
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(22) blocks <- data.frame(rf = rf.all, block = sample(1:3, length(rf.all), rep = TRUE)) blocks <- blocks[order(blocks$block), ] blocks #method: stepRPCr res <- ensemble.blocks(method = "stepRPCr", target = "Creditability", db = loans, blocks = blocks, p.value = 0.05) names(res) nb <- length(res[["models"]]) res$models[[nb]] auc.model(predictions = predict(res$models[[nb]], type = "response", newdata = res$dev.db[[nb]]), observed = res$dev.db[[nb]]$Creditability)
evrs
calculates the economic benefits of improved PD model based on increase of portfolio return.
Implemented algorithm replicates the framework presented in the Reference under assumption that
bank adopts continuous PD rating scale. Despite this assumption, results are almost identical for scenarios
of base case portfolio from the Reference.
evrs( db, pd, benchmark, lgd, target, sigma = NA, r, elasticity, prob.to.leave.threshold, sim.num = 500, seed = 991 )
evrs( db, pd, benchmark, lgd, target, sigma = NA, r, elasticity, prob.to.leave.threshold, sim.num = 500, seed = 991 )
db |
Data frame with at least the following columns: default indicator (target), PDs of model in use, PDs of benchmark model and LGD values. |
pd |
Name of PD of model in use within db argument. |
benchmark |
Name of PD of benchmark model within db argument. |
lgd |
Name of LGD values within db argument. |
target |
Name of target (default indicator 0/1) within db argument. |
sigma |
Measurement error of model in use. If default value ( |
r |
Risk-free rate. |
elasticity |
Elasticity parameter used to define customer churn in case of loan overpricing. |
prob.to.leave.threshold |
Threshold for customers' probability to leave in case of loan overpricing. |
sim.num |
Number of simulations. Default is 500. |
seed |
Random seed to ensure reproducibility. Default is 991. |
The command evrs
returns a list of two elements. The first element is data frame
summary.tbl
and it provides simulation summary: number of simulations, number of successful simulations,
population size (number of observations of supplied db
data frame), measurement error,
average churn value (number of customers that left the portfolio due to the overpricing), average return of simulated
portfolios, return of benchmark portfolio and return difference (main result of the simulation). The second element is
numeric vector of return averages of simulated portfolios.
Jankowitsch at al. (2007). Modelling the economic value of credit rating systems. Journal of Banking & Finance, Volume 31, Issue 1, doi:10.1016/j.jbankfin.2006.01.003.
suppressMessages(library(PDtoolkit)) data(loans) #simulate model in use miu.formula <- Creditability ~ `Age (years)` + `Duration of Credit (month)` + `Value Savings/Stocks` + `Purpose` miu <- glm(miu.formula, family = "binomial", data = loans) miu.pd <- unname(predict(miu, type = "response", newdata = loans)) #simulate benchmark model with interaction.transformer support bnm.pack <- stepFWDr(start.model = Creditability ~ 1, p.value = 0.05, db = loans, check.start.model = TRUE, offset.vals = NULL) bnm <- bnm.pack$model bnm.pd <- unname(predict(bnm, type = "response", newdata = bnm.pack$dev.db)) #prepare data for evrs function db <- data.frame("Creditability" = loans$Creditability, pd = miu.pd, pd.benchmark = bnm.pd, lgd = 0.75) #calculate the difference in portfolio return between model in use the benchmark model res <- evrs(db = db, pd = "pd", benchmark = "pd.benchmark", lgd = "lgd", target = "Creditability", sigma = NA, r = 0.03, elasticity = 100, prob.to.leave.threshold = 0.5, sim.num = 500, seed = 991) names(res) #print simulation summary table res[["summary.tbl"]] #portfolio return increase in case of using benchmark model res[["summary.tbl"]][, "return.difference", drop = FALSE] #summary of simulated returns summary(res[["return.sim"]])
suppressMessages(library(PDtoolkit)) data(loans) #simulate model in use miu.formula <- Creditability ~ `Age (years)` + `Duration of Credit (month)` + `Value Savings/Stocks` + `Purpose` miu <- glm(miu.formula, family = "binomial", data = loans) miu.pd <- unname(predict(miu, type = "response", newdata = loans)) #simulate benchmark model with interaction.transformer support bnm.pack <- stepFWDr(start.model = Creditability ~ 1, p.value = 0.05, db = loans, check.start.model = TRUE, offset.vals = NULL) bnm <- bnm.pack$model bnm.pd <- unname(predict(bnm, type = "response", newdata = bnm.pack$dev.db)) #prepare data for evrs function db <- data.frame("Creditability" = loans$Creditability, pd = miu.pd, pd.benchmark = bnm.pd, lgd = 0.75) #calculate the difference in portfolio return between model in use the benchmark model res <- evrs(db = db, pd = "pd", benchmark = "pd.benchmark", lgd = "lgd", target = "Creditability", sigma = NA, r = 0.03, elasticity = 100, prob.to.leave.threshold = 0.5, sim.num = 500, seed = 991) names(res) #print simulation summary table res[["summary.tbl"]] #portfolio return increase in case of using benchmark model res[["summary.tbl"]][, "return.difference", drop = FALSE] #summary of simulated returns summary(res[["return.sim"]])
fairness.vld
performs fairness validation for a given sensitive attribute and selected outcome.
Sensitive attribute should be categorical variable with reasonable number of modalities, while
outcome can be categorical (e.g. reject/accept indicator or rating grade) or continuous (e.g. interest rate or amount).
Depending on model type outcome (see argument mod.outcome.type
) Chi-square test or Wald test is applied.
fairness.vld( db, sensitive, obs.outcome, mod.outcome, conditional = NULL, mod.outcome.type, p.value )
fairness.vld( db, sensitive, obs.outcome, mod.outcome, conditional = NULL, mod.outcome.type, p.value )
db |
Data frame with sensitive attribute, observed outcome, model outcome and conditional attribute. |
sensitive |
Name of sensitive attribute within |
obs.outcome |
Name of observed outcome within |
mod.outcome |
Name of model outcome within |
conditional |
Name of conditional attribute within |
mod.outcome.type |
Type of model outcome. Possible values are |
p.value |
Significance level of applied statistical test (chi-square or Wald test). |
The command fairness.vld
returns a list of three data frames.
The first object (SP
), provides results of statistical parity testing.
The second object (CSP
), provides results of conditional statistical parity testing.
This object will be returned only if conditional attributed is supplied.
The third object (EO
), provides results of equal opportunity testing.
Hurlin, Christophe and Perignon, Christophe and Saurin, Sebastien (2022), The Fairness of Credit Scoring Models. HEC Paris Research Paper No. FIN-2021-1411
suppressMessages(library(PDtoolkit)) #build hypothetical model data(loans) #numeric risk factors #num.rf <- sapply(loans, is.numeric) #num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] num.rf <- c("Credit Amount", "Age (years)") #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) #run stepMIV rf <- c("Account Balance", "Payment Status of Previous Credit", "Purpose", "Value Savings/Stocks", "Credit Amount", "Age (years)", "Instalment per cent", "Foreign Worker") res <- stepMIV(start.model = Creditability ~ 1, miv.threshold = 0.02, m.ch.p.val = 0.05, coding = "WoE", coding.start.model = FALSE, db = loans[, c("Creditability", rf)]) #print coefficients summary(res$model)$coefficients #prepare data frame for fairness validation db.fa <- data.frame(Creditability = loans$Creditability, mpred = predict(res$model, type = "response", newdata = res$dev.db)) #add hypothetical reject/accept indicator db.fa$rai <- ifelse(db.fa$mpred > 0.5, 1, 0) #add hypothetical rating db.fa$rating <- sts.bin(x = round(db.fa$mpred, 4), y = db.fa$Creditability)[[2]] #add hypothetical interest rate ir.r <- seq(0.03, 0.10, length.out = 6) names(ir.r) <- sort(unique(db.fa$rating)) db.fa$ir <- ir.r[db.fa$rating] #add hypothetical sensitive attribute db.fa$sensitive.1 <- ifelse(loans$"Sex & Marital Status"%in%2, 1, 0) #not in a model db.fa$sensitive.2 <- ifelse(loans$"Age (years)"%in%"03 [35,Inf)", 1, 0) #in a model #add some attributes for calculation of conditional statistical parity db.fa$"Credit Amount" <- loans$"Credit Amount" head(db.fa) #discrete model outcome - sensitive attribute not in a model fairness.vld(db = db.fa, sensitive = "sensitive.1", obs.outcome = "Creditability", mod.outcome = "rai", conditional = "Credit Amount", mod.outcome.type = "disc", p.value = 0.05) ##discrete model outcome - sensitive attribute in a model #fairness.vld(db = db.fa, # sensitive = "sensitive.2", # obs.outcome = "Creditability", # mod.outcome = "rai", # conditional = "Credit Amount", # mod.outcome.type = "disc", # p.value = 0.05) ##continuous outcome - sensitive attribute not in a model #fairness.vld(db = db.fa, # sensitive = "sensitive.1", # obs.outcome = "Creditability", # mod.outcome = "ir", # conditional = "Credit Amount", # mod.outcome.type = "cont", # p.value = 0.05) #continuous outcome - sensitive attribute in a model fairness.vld(db = db.fa, sensitive = "sensitive.2", obs.outcome = "Creditability", mod.outcome = "ir", conditional = "Credit Amount", mod.outcome.type = "cont", p.value = 0.05)
suppressMessages(library(PDtoolkit)) #build hypothetical model data(loans) #numeric risk factors #num.rf <- sapply(loans, is.numeric) #num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] num.rf <- c("Credit Amount", "Age (years)") #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) #run stepMIV rf <- c("Account Balance", "Payment Status of Previous Credit", "Purpose", "Value Savings/Stocks", "Credit Amount", "Age (years)", "Instalment per cent", "Foreign Worker") res <- stepMIV(start.model = Creditability ~ 1, miv.threshold = 0.02, m.ch.p.val = 0.05, coding = "WoE", coding.start.model = FALSE, db = loans[, c("Creditability", rf)]) #print coefficients summary(res$model)$coefficients #prepare data frame for fairness validation db.fa <- data.frame(Creditability = loans$Creditability, mpred = predict(res$model, type = "response", newdata = res$dev.db)) #add hypothetical reject/accept indicator db.fa$rai <- ifelse(db.fa$mpred > 0.5, 1, 0) #add hypothetical rating db.fa$rating <- sts.bin(x = round(db.fa$mpred, 4), y = db.fa$Creditability)[[2]] #add hypothetical interest rate ir.r <- seq(0.03, 0.10, length.out = 6) names(ir.r) <- sort(unique(db.fa$rating)) db.fa$ir <- ir.r[db.fa$rating] #add hypothetical sensitive attribute db.fa$sensitive.1 <- ifelse(loans$"Sex & Marital Status"%in%2, 1, 0) #not in a model db.fa$sensitive.2 <- ifelse(loans$"Age (years)"%in%"03 [35,Inf)", 1, 0) #in a model #add some attributes for calculation of conditional statistical parity db.fa$"Credit Amount" <- loans$"Credit Amount" head(db.fa) #discrete model outcome - sensitive attribute not in a model fairness.vld(db = db.fa, sensitive = "sensitive.1", obs.outcome = "Creditability", mod.outcome = "rai", conditional = "Credit Amount", mod.outcome.type = "disc", p.value = 0.05) ##discrete model outcome - sensitive attribute in a model #fairness.vld(db = db.fa, # sensitive = "sensitive.2", # obs.outcome = "Creditability", # mod.outcome = "rai", # conditional = "Credit Amount", # mod.outcome.type = "disc", # p.value = 0.05) ##continuous outcome - sensitive attribute not in a model #fairness.vld(db = db.fa, # sensitive = "sensitive.1", # obs.outcome = "Creditability", # mod.outcome = "ir", # conditional = "Credit Amount", # mod.outcome.type = "cont", # p.value = 0.05) #continuous outcome - sensitive attribute in a model fairness.vld(db = db.fa, sensitive = "sensitive.2", obs.outcome = "Creditability", mod.outcome = "ir", conditional = "Credit Amount", mod.outcome.type = "cont", p.value = 0.05)
heterogeneity
performs heterogeneity testing of PD model based on the rating grades.
This test is usually applied on application portfolio, but it can be applied also on model development sample.
heterogeneity(app.port, def.ind, rating, alpha = 0.05)
heterogeneity(app.port, def.ind, rating, alpha = 0.05)
app.port |
Application portfolio (data frame) which contains default indicator (0/1) and ratings in use. |
def.ind |
Name of the column that represents observed default indicator (0/1). |
rating |
Name of the column that represent rating grades in use. |
alpha |
Significance level of p-value for two proportion test. Default is 0.05. |
Testing procedure starts with summarizing the number of observations and defaults per rating grade.
After that, two proportion test is applied on adjacent rating grades. Testing hypothesis is that
default rate of grade i
is less or greater than default rate of grade i - 1
, where i
takes the values from 2 to the number of unique grades.
Direction of alternative hypothesis (less or greater) is determined automatically based on correlation direction
of observed default on rating grades.
Incomplete cases, identified based on default indicator (def.ind
) and rating grade (rating
)
columns are excluded from the summary table and testing procedure. If identified, warning will be returned.
The command heterogeneity
returns a data frame with the following columns:
rating: Unique values of rating grades from application portfolio.
no: Number of complete observations.
nb: Number of defaults (bad cases) in complete observations.
p.val: Test p-value (two proportion test of adjacent rating grades).
alpha: Selected significance level.
res: Accepted hypothesis.
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` + `Value Savings/Stocks` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into ratings loans$rating.1 <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #group probabilities into ratings loans$rating.2 <- sts.bin(x = round(loans$pred, 4), y = loans$Creditability, y.type = "bina")[[2]] #simulate dummy application portfolio set.seed(1984) app.port <- loans[sample(1:nrow(loans), 400, rep = TRUE), ] #run heterogeneity test on ratings based on the scaled score #higher score lower default rate heterogeneity(app.port = app.port, def.ind = "Creditability", rating = "rating.1", alpha = 0.05) #run test on predicted default rate - direction of the test is changed heterogeneity(app.port = app.port, def.ind = "Creditability", rating = "rating.2", alpha = 0.05)
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` + `Value Savings/Stocks` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into ratings loans$rating.1 <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #group probabilities into ratings loans$rating.2 <- sts.bin(x = round(loans$pred, 4), y = loans$Creditability, y.type = "bina")[[2]] #simulate dummy application portfolio set.seed(1984) app.port <- loans[sample(1:nrow(loans), 400, rep = TRUE), ] #run heterogeneity test on ratings based on the scaled score #higher score lower default rate heterogeneity(app.port = app.port, def.ind = "Creditability", rating = "rating.1", alpha = 0.05) #run test on predicted default rate - direction of the test is changed heterogeneity(app.port = app.port, def.ind = "Creditability", rating = "rating.2", alpha = 0.05)
hhi
performs calculation on Herfindahl-Hirschman Index.
hhi(x)
hhi(x)
x |
Numeric vector of input values (e.g. number of observations or sum of exposure per rating grade). |
The command hhinormal.test
returns single element numeric vector of HHI value.
#simulate PD model and rating scale suppressMessages(library(PDtoolkit)) data(loans) res <- stepFWDr(start.model = Creditability ~ 1, p.value = 0.05, db = loans) mod.predictions <- unname(predict(res$model, type = "response")) rating.scale <- sts.bin(y = loans$Creditability, x = mod.predictions)[[1]] #calculate HHI hhi(x = rating.scale$no)
#simulate PD model and rating scale suppressMessages(library(PDtoolkit)) data(loans) res <- stepFWDr(start.model = Creditability ~ 1, p.value = 0.05, db = loans) mod.predictions <- unname(predict(res$model, type = "response")) rating.scale <- sts.bin(y = loans$Creditability, x = mod.predictions)[[1]] #calculate HHI hhi(x = rating.scale$no)
homogeneity
performs homogeneity testing of PD model based on the rating grades and selected segment.
This test is usually applied on application portfolio, but it can be applied also on model development sample.
Additionally, this method requires higher number of observations per segment modalities within each rating in order
to produce available results. For segments with less than 30 observations, test is not performed.
If as a segment user selects numeric variable from the application portfolio, variable will be grouped in selected number of
groups (argument segment.num
).
homogeneity(app.port, def.ind, rating, segment, segment.num, alpha = 0.05)
homogeneity(app.port, def.ind, rating, segment, segment.num, alpha = 0.05)
app.port |
Application portfolio (data frame) which contains default indicator (0/1), ratings in use and variable used as a segment. |
def.ind |
Name of the column that represents observed default indicator (0/1). |
rating |
Name of the column that represent rating grades in use. |
segment |
Name of the column that represent testing segments. If it is of numeric type, than it is first grouped
into |
segment.num |
Number of groups used for numeric variables supplied as a segment. Only applicable if |
alpha |
Significance level of p-value for two proportion test. Default is 0.05. |
Testing procedure is implemented for each rating separately comparing default rate from one segment modality to the default rate from the rest of segment modalities.
The command homogeneity
returns a data frame with the following columns:
segment.var: Variable used as a segment.
rating: Unique values of rating grades from application portfolio..
segment.mod: Tested segment modality. Default rate from this segment is compared with default rate from the rest of the modalities within the each rating.
no: Number of observations of the analyzed rating.
nb: Number of defaults (bad cases) of the analyzed rating.
no.segment: Number of observations of the analyzed segment modality.
no.rest: Number of observations of the rest of the segment modalities.
nb.segment: Number of defaults of the analyzed segment modality.
nb.rest: Number of defaults of the rest of the segment modalities.
p.val: Two proportion test (two sided) p-value.
alpha: Selected significance level.
res: Accepted hypothesis.
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` + `Value Savings/Stocks` + `Duration in Current address` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into ratings loans$rating <- ndr.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #simulate dummy application portfolio (oversample loans data) set.seed(2211) app.port <- loans[sample(1:nrow(loans), 2500, rep = TRUE), ] #run homogeneity test on ratings based on the Credit Amount segments homogeneity(app.port = app.port, def.ind = "Creditability", rating = "rating", segment = "Credit Amount", segment.num = 4, alpha = 0.05)
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` + `Value Savings/Stocks` + `Duration in Current address` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into ratings loans$rating <- ndr.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #simulate dummy application portfolio (oversample loans data) set.seed(2211) app.port <- loans[sample(1:nrow(loans), 2500, rep = TRUE), ] #run homogeneity test on ratings based on the Credit Amount segments homogeneity(app.port = app.port, def.ind = "Creditability", rating = "rating", segment = "Credit Amount", segment.num = 4, alpha = 0.05)
imp.outliers
replaces predefined quantum of the smallest and largest values by the less
extreme values. This procedure is applicable only to the numeric risk factors.
imp.outliers( db, sc = c(NA, NaN, Inf, -Inf), method = "iqr", range = 1.5, upper.pct = 0.95, lower.pct = 0.05 )
imp.outliers( db, sc = c(NA, NaN, Inf, -Inf), method = "iqr", range = 1.5, upper.pct = 0.95, lower.pct = 0.05 )
db |
Data frame of risk factors supplied for imputation. |
sc |
Vector of all special case elements. Default values are |
method |
Imputation method. Available options are: |
range |
Determines how far the plot whiskers extend out from the box. If range is positive,
the whiskers extend to the most extreme data point which is no more than range times the
interquartile range from the box. A value of zero causes the whiskers to extend to
the data extremes. Default |
upper.pct |
Upper limit for percentile method. All values above this limit will be replaced by the value
identified at this percentile. Default value is set to |
lower.pct |
Lower limit for percentile method. All values below this limit will be replaced by the value
identified at this percentile. Default value is set to |
This function returns list of two data frames. The first data frame contains analyzed risk factors with
imputed values for outliers, while the second data frame presents the imputation report. Using the imputation report,
for each risk factor, user can inspect imputed info (info
), imputation method (imputation.method
),
imputed value (imputation.val.upper
and imputation.val.lower
),
number of imputed observations (imputation.num.upper
and imputation.num.lower
).
suppressMessages(library(PDtoolkit)) data(gcd) gcd$age[1:20] <- NA gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, sc.method = "separately", y.type = "bina")[[2]] gcd$dummy1 <- NA imput.res.1 <- imp.outliers(db = gcd[, -1], method = "iqr", range = 1.5) #analyzed risk factors with imputed values head(imput.res.1[[1]]) #imputation report imput.res.1[[2]] #percentile method imput.res.2 <- imp.outliers(db = gcd[, -1], method = "percentile", upper.pct = 0.95, lower.pct = 0.05) #analyzed risk factors with imputed values head(imput.res.2[[1]]) #imputation report imput.res.2[[2]]
suppressMessages(library(PDtoolkit)) data(gcd) gcd$age[1:20] <- NA gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, sc.method = "separately", y.type = "bina")[[2]] gcd$dummy1 <- NA imput.res.1 <- imp.outliers(db = gcd[, -1], method = "iqr", range = 1.5) #analyzed risk factors with imputed values head(imput.res.1[[1]]) #imputation report imput.res.1[[2]] #percentile method imput.res.2 <- imp.outliers(db = gcd[, -1], method = "percentile", upper.pct = 0.95, lower.pct = 0.05) #analyzed risk factors with imputed values head(imput.res.2[[1]]) #imputation report imput.res.2[[2]]
imp.sc
imputes value for special cases.
imp.sc( db, sc.all = c(NA, NaN, Inf, -Inf), sc.replace = c(NA, NaN, Inf, -Inf), method.num = "automatic", p.val = 0.05 )
imp.sc( db, sc.all = c(NA, NaN, Inf, -Inf), sc.replace = c(NA, NaN, Inf, -Inf), method.num = "automatic", p.val = 0.05 )
db |
Data frame of risk factors supplied for imputation. |
sc.all |
Vector of all special case elements. Default values are |
sc.replace |
Vector of special case element to be replaced. Default values are |
method.num |
Imputation method for numeric risk factors. Available options are: |
p.val |
Significance level of p-value for Pearson normality test. Applicable only if |
This function returns list of two data frames. The first data frame contains analyzed risk factors with
imputed values for special cases, while the second data frame presents the imputation report.
Using the imputation report, for each risk factor, user can inspect imputed info (info
),
imputation method (imputation.method
), imputed value (imputed.value
),
number of imputed observations (imputation.num
) and imputed mode
(imputed.mode
- applicable only for categorical risk factors) for each risk factor.
suppressMessages(library(PDtoolkit)) data(gcd) gcd$age[1:20] <- NA gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, sc.method = "separately", y.type = "bina")[[2]] gcd$dummy1 <- NA #select risk factors for which we want to impute missing values (NA) db.imp <- gcd[, c("age", "age.bin", "dummy1")] colSums(is.na(db.imp)) imput.res <- imp.sc(db = db.imp, method.num = "automatic", p.val = 0.05) #analyzed risk factors with imputed values head(imput.res[[1]]) #imputation report imput.res[[2]]
suppressMessages(library(PDtoolkit)) data(gcd) gcd$age[1:20] <- NA gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, sc.method = "separately", y.type = "bina")[[2]] gcd$dummy1 <- NA #select risk factors for which we want to impute missing values (NA) db.imp <- gcd[, c("age", "age.bin", "dummy1")] colSums(is.na(db.imp)) imput.res <- imp.sc(db = db.imp, method.num = "automatic", p.val = 0.05) #analyzed risk factors with imputed values head(imput.res[[1]]) #imputation report imput.res[[2]]
interaction.transformer
extracts the interaction between supplied risk factors from decision tree.
It implements customized decision tree algorithm that takes into account different conditions such as minimum
percentage of observations and defaults in each node, maximum tree depth and monotonicity condition
at each splitting node. Gini index is used as metric for node splitting .
interaction.transformer( db, rf, target, min.pct.obs, min.avg.rate, max.depth, monotonicity, create.interaction.rf )
interaction.transformer( db, rf, target, min.pct.obs, min.avg.rate, max.depth, monotonicity, create.interaction.rf )
db |
Data frame of risk factors and target variable supplied for interaction extraction. |
rf |
Character vector of risk factor names on which decision tree is run. |
target |
Name of target variable (default indicator 0/1) within db argument. |
min.pct.obs |
Minimum percentage of observation in each leaf. |
min.avg.rate |
Minimum percentage of defaults in each leaf. |
max.depth |
Maximum number of splits. |
monotonicity |
Logical indicator. If |
create.interaction.rf |
Logical indicator. If |
The command interaction.transformer
returns a list of two data frames. The first data frame provides
the tree summary. The second data frame is a new risk factor extracted from decision tree.
suppressMessages(library(PDtoolkit)) data(loans) #modify risk factors in order to show how the function works with missing values loans$"Account Balance"[1:10] <- NA loans$"Duration of Credit (month)"[c(13, 15)] <- NA it <- interaction.transformer(db = loans, rf = c("Account Balance", "Duration of Credit (month)"), target = "Creditability", min.pct.obs = 0.05, min.avg.rate = 0.01, max.depth = 2, monotonicity = TRUE, create.interaction.rf = TRUE) names(it) it[["tree.info"]] tail(it[["interaction"]]) table(it[["interaction"]][, "rf.inter"], useNA = "always")
suppressMessages(library(PDtoolkit)) data(loans) #modify risk factors in order to show how the function works with missing values loans$"Account Balance"[1:10] <- NA loans$"Duration of Credit (month)"[c(13, 15)] <- NA it <- interaction.transformer(db = loans, rf = c("Account Balance", "Duration of Credit (month)"), target = "Creditability", min.pct.obs = 0.05, min.avg.rate = 0.01, max.depth = 2, monotonicity = TRUE, create.interaction.rf = TRUE) names(it) it[["tree.info"]] tail(it[["interaction"]]) table(it[["interaction"]][, "rf.inter"], useNA = "always")
kfold.idx
provides indices for K-fold validation.
kfold.idx(target, k = 10, type, seed = 2191)
kfold.idx(target, k = 10, type, seed = 2191)
target |
Binary target variable. |
k |
Number of folds. If |
type |
Sampling type. Possible options are |
seed |
Random seed needed for ensuring the result reproducibility. Default is 2191. |
The command kfold.idx
returns a list of k folds estimation and validation indices.
suppressMessages(library(PDtoolkit)) data(loans) #good-bad ratio prop.table(table(loans$Creditability)) #random k-folds kf.r <- kfold.idx(target = loans$Creditability, k = 5, type = "random", seed = 2191) lapply(kf.r, function(x) prop.table(table(loans$Creditability[x[[2]]]))) #stratified k-folds kf.s <- kfold.idx(target = loans$Creditability, k = 5, type = "stratified", seed = 2191) lapply(kf.s, function(x) prop.table(table(loans$Creditability[x[[2]]])))
suppressMessages(library(PDtoolkit)) data(loans) #good-bad ratio prop.table(table(loans$Creditability)) #random k-folds kf.r <- kfold.idx(target = loans$Creditability, k = 5, type = "random", seed = 2191) lapply(kf.r, function(x) prop.table(table(loans$Creditability[x[[2]]]))) #stratified k-folds kf.s <- kfold.idx(target = loans$Creditability, k = 5, type = "stratified", seed = 2191) lapply(kf.s, function(x) prop.table(table(loans$Creditability[x[[2]]])))
kfold.vld
performs k-fold model cross-validation.
The main goal of this procedure is to generate main model performance metrics such as absolute mean
square error, root mean square error or area under curve (AUC) based on resampling method.
kfold.vld(model, k = 10, seed = 1984)
kfold.vld(model, k = 10, seed = 1984)
model |
Model in use, an object of class inheriting from |
k |
Number of folds. If |
seed |
Random seed needed for ensuring the result reproducibility. Default is 1984. |
The command kfold.vld
returns a list of two objects.
The first object (iter
), returns iteration performance metrics.
The second object (summary
), is the data frame of iterations averages of performance metrics.
suppressMessages(library(PDtoolkit)) data(loans) #run stepFWD res <- stepFWD(start.model = Creditability ~ 1, coding = "WoE", db = loans) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #print head of coded development data head(res$dev.db) #calculate AUC auc.model(predictions = predict(final.model, type = "response", newdata = res$dev.db), observed = res$dev.db$Creditability) kfold.vld(model = final.model, k = 10, seed = 1984)
suppressMessages(library(PDtoolkit)) data(loans) #run stepFWD res <- stepFWD(start.model = Creditability ~ 1, coding = "WoE", db = loans) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #print head of coded development data head(res$dev.db) #calculate AUC auc.model(predictions = predict(final.model, type = "response", newdata = res$dev.db), observed = res$dev.db$Creditability) kfold.vld(model = final.model, k = 10, seed = 1984)
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Name of the columns are used as give in the source file. Note that subset of those data is available also in 'monobin' package (gcd) and used for some examples in 'PDtoolkit' package.
loans
loans
An object of class data.frame
with 1000 rows and 21 columns.
https://online.stat.psu.edu/stat857/node/215/
normal.test
performs multi-period testing of PD model predictive power. This procedure can be applied
on the level of the rating grade as well on the portfolio level.
normal.test(pdc, odr, alpha = 0.05)
normal.test(pdc, odr, alpha = 0.05)
pdc |
Numeric vector of calibrated probabilities of default (PD). |
odr |
Numeric vector of observed default rates. |
alpha |
Significance level of p-value for implemented tests. Default is 0.05. |
The command normal.test
returns a data frame with estimated difference between odr
and
pdc
, test statistics, standard error of the test statistics, selected significance level,
p-value of test statistics and finally the test results.
Basel Committee on Banking Supervision (2005). Studies on the Validation of Internal Rating Systems, working paper no. 14.
set.seed(678) normal.test(pdc = rep(0.02, 5), odr = runif(5, 0.02, 0.03))
set.seed(678) normal.test(pdc = rep(0.02, 5), odr = runif(5, 0.02, 0.03))
num.slice
implements manual discretization of numeric vector for a given boundaries. This procedure is one
of the helper functions which are handy for the model monitoring phase (i.e. after model implementation).
num.slice(x, mapping, sc = c(NA, NaN, Inf, -Inf), sc.r = "SC")
num.slice(x, mapping, sc = c(NA, NaN, Inf, -Inf), sc.r = "SC")
x |
Numeric vector to be discretized. |
mapping |
Data frame with compulsory columns: |
sc |
Numeric vector with special case elements. Default values are |
sc.r |
Character vector used for replacement of special cases. If supplied as one
element vector, it will be recycled to the length of |
The command num.slice
returns vector of discretized values and coded special cases.
suppressMessages(library(PDtoolkit)) data(gcd) x <- gcd$maturity #artificially add some special values x[1:5] <- Inf x[6:7] <- NA #perform monotonic grouping in order to get bins' boundaries mbin <- sts.bin(x = x, y = gcd$qual, sc.method = "separately") mbin[[1]] #slice numeric variable sn <- num.slice(x = x, mapping = data.frame(x.min = mbin[[1]]$x.min[-c(1, 2)], x.max = mbin[[1]]$x.max[-c(1, 2)]), sc = c(NA, NaN, Inf, -Inf), sc.r = "SC") #compare automatic and manual binning table(mbin[[2]], useNA = "always") table(sn, useNA = "always")
suppressMessages(library(PDtoolkit)) data(gcd) x <- gcd$maturity #artificially add some special values x[1:5] <- Inf x[6:7] <- NA #perform monotonic grouping in order to get bins' boundaries mbin <- sts.bin(x = x, y = gcd$qual, sc.method = "separately") mbin[[1]] #slice numeric variable sn <- num.slice(x = x, mapping = data.frame(x.min = mbin[[1]]$x.min[-c(1, 2)], x.max = mbin[[1]]$x.max[-c(1, 2)]), sc = c(NA, NaN, Inf, -Inf), sc.r = "SC") #compare automatic and manual binning table(mbin[[2]], useNA = "always") table(sn, useNA = "always")
nzv
procedure aims to identify risk factors with low variability (almost constants). Usually these risk factors are
expertly investigated and decision is made if they should be excluded from further modeling process. nzv
output report includes the following metrics:
rf: Risk factor name.
rf.type: Risk factor class. This metric is always one of: numeric
or categorical
.
sc.num: Number of special cases.
sc.pct: Percentage of special cases in total number of observations.
cc.num: Number of complete cases.
cc.pct: Percentage of complete cases in total number of observations. Sum of this value and sc.pct
is equal to 1.
cc.unv: Number of unique values in complete cases.
cc.unv.pct: Percentage of unique values in total number of complete cases.
cc.lbl.1: The most frequent value in complete cases.
cc.frq.1: Number of occurrence of the most frequent value in complete cases.
cc.lbl.2: The second most frequent value in complete cases.
cc.frq.2: Number of occurrence of the second most frequent value in complete cases.
cc.fqr: Frequency ratio - the ratio between the occurrence of most frequent and the second most frequent value in complete cases.
ind: Indicator which takes value of 1
if the percentage of complete cases is less then 10% and frequency ratio
(cc.fqr
) greater than 19. This values can be used for filtering risk factors that need further expert
investigation, but user are also encourage to derive its own indicators based on reported
metrics.
nzv(db, sc = c(NA, NaN, Inf, -Inf))
nzv(db, sc = c(NA, NaN, Inf, -Inf))
db |
Data frame of risk factors supplied for near-zero variance analysis. |
sc |
Numeric or character vector with special case elements. Default values are |
The command nzv
returns the data frame with different matrices needed for identification of near-zero variables.
For details see Description section.
suppressMessages(library(PDtoolkit)) data(loans) #artificially add some special values loans$"Account Balance"[1:10] <- NA rf.s <- nzv(db = loans, sc = c(NA, NaN, Inf, -Inf)) rf.s
suppressMessages(library(PDtoolkit)) data(loans) #artificially add some special values loans$"Account Balance"[1:10] <- NA rf.s <- nzv(db = loans, sc = c(NA, NaN, Inf, -Inf)) rf.s
power
performs Monte Carlo simulation of power of statistical test used for testing the predictive
ability of the PD rating model. It covers fours tests: the binomial, Jeffreys, z-score and Hosmer-Lemeshow test.
This procedure is applied under assumption that the observed default rate is the true one and it is
used to check if calibrated PDs are underestimated for the binomial, Jeffreys, and z-score.
Therefore, for the cases where observed default rate is lower than the calibrated PD, the power calculation is not performed and will report the comment.
For the Hosmer-Lemeshow test is used to test if the calibrated PD is the true one regardless the difference between the observed and calibrated
portfolio default rate.
power(rating.label, pdc, no, nb, alpha = 0.05, sim.num = 1000, seed = 2211)
power(rating.label, pdc, no, nb, alpha = 0.05, sim.num = 1000, seed = 2211)
rating.label |
Vector of rating labels. |
pdc |
Vector of calibrated probabilities of default (PD). |
no |
Number of observations per rating grade. |
nb |
Number of defaults (bad cases) per rating grade. |
alpha |
Significance level of p-value for implemented tests. Default is 0.05. |
sim.num |
Number of Monte Carlo simulations. Default is 1000. |
seed |
Random seed needed for ensuring the result reproducibility. Default is 2211. |
Due to the fact that test of predictive power is usually implemented on the application portfolio, certain prerequisites are needed to be fulfilled. In the first place model should be developed and rating scale should be formed. In order to reflect appropriate role and right moment of tests application, presented simplified example covers all steps before test implementation.
The command power
returns a list with two objects. Both are the data frames and
while the first one presents power calculation of the tests applied usually on the
rating level (binomial, Jeffreys and z-score test), the second one presents results of
the Hosmer-Lemeshow test which is applied on the complete rating scale.
For both level of the implementation (rating or complete scale) if the observed default
rate is less than calibrated PD, function will return the comment and power simulation
will not be performed.
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.27 min.pd <- 0.05 rs$pd <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab")[[1]] #check rs sum(rs$pd * rs$no / sum(rs$no)) #simulate some dummy application portfolio set.seed(22) app.port <- loans[sample(1:nrow(loans), 400), ] #summarise application portfolio on rating level ap.summary <- app.port %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) #bring calibrated pd as a based for predictive power testing ap.summary <- merge(rs[, c("rating", "pd")], ap.summary, by = "rating", all.x = TRUE) ap.summary #perform predictive power testing pp.res <- pp.testing(rating.label = ap.summary$rating, pdc = ap.summary$pd, no = ap.summary$no, nb = ap.summary$nb, alpha = 0.05) pp.res power(rating.label = ap.summary$rating, pdc = ap.summary$pd, no = ap.summary$no, nb = ap.summary$nb, alpha = 0.05, sim.num = 1000, seed = 2211)
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.27 min.pd <- 0.05 rs$pd <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab")[[1]] #check rs sum(rs$pd * rs$no / sum(rs$no)) #simulate some dummy application portfolio set.seed(22) app.port <- loans[sample(1:nrow(loans), 400), ] #summarise application portfolio on rating level ap.summary <- app.port %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) #bring calibrated pd as a based for predictive power testing ap.summary <- merge(rs[, c("rating", "pd")], ap.summary, by = "rating", all.x = TRUE) ap.summary #perform predictive power testing pp.res <- pp.testing(rating.label = ap.summary$rating, pdc = ap.summary$pd, no = ap.summary$no, nb = ap.summary$nb, alpha = 0.05) pp.res power(rating.label = ap.summary$rating, pdc = ap.summary$pd, no = ap.summary$no, nb = ap.summary$nb, alpha = 0.05, sim.num = 1000, seed = 2211)
pp.testing
performs testing of predictive power of the PD rating model. This procedure should be applied
on the level of the rating scale.
Four tests are implemented: the binomial, Jeffreys, z-score and Hosmer-Lemeshow test.
Only the Hosmer-Lemeshow test refers to complete rating scale, while the remaining three are implemented on the
rating grade level. The null hypothesis for the binomial, Jeffreys, and z-score tests is that the observed default rate
is less or equal to the calibrated PD (
pdc
) while for the Hosmer-Lemeshow test is that
the calibrated PD (pdc
) is the true one.
pp.testing(rating.label, pdc, no, nb, alpha = 0.05)
pp.testing(rating.label, pdc, no, nb, alpha = 0.05)
rating.label |
Vector of rating labels. |
pdc |
Vector of calibrated probabilities of default (PD). |
no |
Number of observations per rating grade. |
nb |
Number of defaults (bad cases) per rating grade. |
alpha |
Significance level of p-value for implemented tests. Default is 0.05. |
Due to the fact that test of predictive power is usually implemented on the application portfolio, certain prerequisites are needed to be fulfilled. In the first place model should be developed and rating scale should be formed. In order to reflect appropriate role and right moment of tests application, presented simplified example covers all steps before test implementation.
The command pp.testing
returns a data frame with input parameters along with
p-value for each implemented test and the accepted hypothesis. Due to the fact that
Hosmer-Lemeshow test is applied to complete rating scale, returned p-values are all equal between
the rating grades as well as the test results.
Tasche, D. (2008). Validation of internal rating systems and PD estimates,
The Analytics of Risk Model Validation, Quantitative Finance,
Elsevier B.V., doi:10.1016/B978-075068158-2.50014-7.
Oesterreichische Nationalbank (2004). Rating Models and Validation,
Oesterreichische Nationalbank (OeNB).
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.33 min.pd <- 0.05 rs$pd <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab")[[1]] #checks rs sum(rs$pd * rs$no / sum(rs$no)) #simulate some dummy application portfolio set.seed(11) app.port <- loans[sample(1:nrow(loans), 400), ] #summarise application portfolio on rating level ap.summary <- app.port %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) #bring calibrated pd as a based for predictive power testing ap.summary <- merge(rs[, c("rating", "pd")], ap.summary, by = "rating", all.x = TRUE) ap.summary #perform predictive power testing pp.res <- pp.testing(rating.label = ap.summary$rating, pdc = ap.summary$pd, no = ap.summary$no, nb = ap.summary$nb, alpha = 0.05) pp.res
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.33 min.pd <- 0.05 rs$pd <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab")[[1]] #checks rs sum(rs$pd * rs$no / sum(rs$no)) #simulate some dummy application portfolio set.seed(11) app.port <- loans[sample(1:nrow(loans), 400), ] #summarise application portfolio on rating level ap.summary <- app.port %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) #bring calibrated pd as a based for predictive power testing ap.summary <- merge(rs[, c("rating", "pd")], ap.summary, by = "rating", all.x = TRUE) ap.summary #perform predictive power testing pp.res <- pp.testing(rating.label = ap.summary$rating, pdc = ap.summary$pd, no = ap.summary$no, nb = ap.summary$nb, alpha = 0.05) pp.res
Predict method for custom decision tree
## S3 method for class 'cdt' predict(object, newdata = NULL, ...)
## S3 method for class 'cdt' predict(object, newdata = NULL, ...)
object |
Custom decision tree model (class cdt). |
newdata |
Optionally, a data frame in which to look for variables with which to predict. If omitted, the fitted predictors are used. |
... |
further arguments passed to or from other methods. |
Returns average default rate per leaf.
#S3 method for class "cdt" suppressMessages(library(PDtoolkit)) data(loans) tree.res <- decision.tree(db = loans, rf = c("Account Balance", "Duration of Credit (month)"), target = "Creditability", min.pct.obs = 0.05, min.avg.rate = 0.01, p.value = 0.05, max.depth = NA, monotonicity = TRUE) str(tree.res) #predict method - development sample pred.1 <- predict(object = tree.res, newdata = NULL) head(pred.1) auc.model(predictions = pred.1$average, observed = loans$Creditability) #predict method - new data set.seed(321) loans.m <- loans[sample(1:nrow(loans), 500, replace = TRUE), ] pred.2 <- predict(object = tree.res, newdata = loans.m) head(pred.2) auc.model(predictions = pred.2$average, observed = loans.m$Creditability)
#S3 method for class "cdt" suppressMessages(library(PDtoolkit)) data(loans) tree.res <- decision.tree(db = loans, rf = c("Account Balance", "Duration of Credit (month)"), target = "Creditability", min.pct.obs = 0.05, min.avg.rate = 0.01, p.value = 0.05, max.depth = NA, monotonicity = TRUE) str(tree.res) #predict method - development sample pred.1 <- predict(object = tree.res, newdata = NULL) head(pred.1) auc.model(predictions = pred.1$average, observed = loans$Creditability) #predict method - new data set.seed(321) loans.m <- loans[sample(1:nrow(loans), 500, replace = TRUE), ] pred.2 <- predict(object = tree.res, newdata = loans.m) head(pred.2) auc.model(predictions = pred.2$average, observed = loans.m$Creditability)
psi
calculates Population Stability Index (PSI) for a given base and target vectors. Function can be used for
testing the stability of final model score, but also for testing a risk factor stability (aka Characteristic Stability Index).
Function also provides so-called critical values of z-score (based on normal distribution assumption) and chi-square
(based on Chi-square distribution) that can be used as alternatives for fixed "rule of thumb" thresholds
(10% and 25%). For details see the Reference.
psi(base, target, bin = 10, alpha = 0.05)
psi(base, target, bin = 10, alpha = 0.05)
base |
Vector of value from base sample. Usually this is training (model development) sample. |
target |
Vector of value from target sample. Usually this is testing or portfolio application sample. |
bin |
Number of bins. Applied only for numeric base and target and used for discretization of its values. Default is 10. |
alpha |
Significance level used for calculation of statistical critical values
( |
The command psi
returns a list of two data frames. The first data frame contains values of
PSI along with statistical critical values for confidence level of 1 - alpha
, while second data frame
presents summary table used for the calculation of overall PSI. For numeric base and target vectors, summary
table is presented on the bin (bucket level), while for the categorical modalities of base and target vectors
are tabulated.
Yurdakul, B. (2018). Statistical Properties of Population Stability Index . Dissertations. 3208. downloaded from here
suppressMessages(library(PDtoolkit)) data(loans) #split on training and testing data set set.seed(1122) tt.indx <- sample(1:nrow(loans), 700, replace = FALSE) training <- loans[tt.indx, ] testing <- loans[-tt.indx, ] #calculate psi for numeric risk factor psi(base = training[, "Age (years)"], target = testing[, "Age (years)"], bin = 10, alpha = 0.05) #calculate psi for categorical risk factor psi(base = training[, "Account Balance"], target = testing[, "Account Balance"], bin = 10, alpha = 0.05)
suppressMessages(library(PDtoolkit)) data(loans) #split on training and testing data set set.seed(1122) tt.indx <- sample(1:nrow(loans), 700, replace = FALSE) training <- loans[tt.indx, ] testing <- loans[-tt.indx, ] #calculate psi for numeric risk factor psi(base = training[, "Age (years)"], target = testing[, "Age (years)"], bin = 10, alpha = 0.05) #calculate psi for categorical risk factor psi(base = training[, "Account Balance"], target = testing[, "Account Balance"], bin = 10, alpha = 0.05)
replace.woe
replaces modalities of risk factor with calculated WoE value. This function process only
categorical risk factors, thus it is assumed that numerical risk factors are previously categorized.
Additional info report (second element of function output - info
data frame), if produced, includes:
rf: Risk factor name.
reason.code: Reason code takes value 1 if inappropriate class of risk factor is identified. It takes value 2 if maximum number of categories exceeds 10, while 3 if there are any problem with weights of evidence (WoE) calculations (usually if any bin contains only good or bad cases). If validation 1 and 3 are observed, risk factor is not process for WoE replacement.
comment: Reason description.
replace.woe(db, target)
replace.woe(db, target)
db |
Data frame of categorical risk factors and target variable supplied for WoE coding. |
target |
Name of target variable within |
The command replace.woe
returns the list of two data frames. The first one contains WoE replacement
of analyzed risk factors' modalities, while the second data frame reports results of above
mentioned validations regarding class of the risk factors, number of modalities and WoE calculation.
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factor gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual, y.type = "bina")[[2]] gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, y.type = "bina")[[2]] head(gcd) #replace modalities with WoE values woe.rep <- replace.woe(db = gcd, target = "qual") #results overview head(woe.rep[[1]]) woe.rep[[2]]
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factor gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual, y.type = "bina")[[2]] gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, y.type = "bina")[[2]] head(gcd) #replace modalities with WoE values woe.rep <- replace.woe(db = gcd, target = "qual") #results overview head(woe.rep[[1]]) woe.rep[[2]]
rf.clustering
implements correlation based clustering of risk factors.
Clustering procedure is base on hclust from stats
package.
rf.clustering(db, metric, k = NA)
rf.clustering(db, metric, k = NA)
db |
Data frame of risk factors supplied for clustering analysis. |
metric |
Correlation metric used for distance calculation. Available options are:
|
k |
Number of clusters. If default value ( |
The function rf.clustering
returns a data frame with: risk factors, clusters assigned and
distance to centroid (ordered from smallest to largest).
The last column (distance to centroid) can be used for selection of one or more risk factors per
cluster.
suppressMessages(library(PDtoolkit)) library(rpart) data(loans) #clustering using common spearman metric #first we need to categorize numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] loans[, num.rf] <- sapply(num.rf, function(x) sts.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) #replace woe in order to convert to all numeric factors loans.woe <- replace.woe(db = loans, target = "Creditability")[[1]] cr <- rf.clustering(db = loans.woe[, -which(names(loans.woe)%in%"Creditability")], metric = "common spearman", k = NA) cr #select one risk factor per cluster with min distance to centorid cr %>% group_by(clusters) %>% slice(which.min(dist.to.centroid))
suppressMessages(library(PDtoolkit)) library(rpart) data(loans) #clustering using common spearman metric #first we need to categorize numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] loans[, num.rf] <- sapply(num.rf, function(x) sts.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) #replace woe in order to convert to all numeric factors loans.woe <- replace.woe(db = loans, target = "Creditability")[[1]] cr <- rf.clustering(db = loans.woe[, -which(names(loans.woe)%in%"Creditability")], metric = "common spearman", k = NA) cr #select one risk factor per cluster with min distance to centorid cr %>% group_by(clusters) %>% slice(which.min(dist.to.centroid))
rf.interaction.transformer
extracts the interactions from random forest.
It implements customized random forest algorithm that takes into account different conditions (for single decision tree) such as minimum
percentage of observations and defaults in each node, maximum tree depth and monotonicity condition
at each splitting node. Gini index is used as metric for node splitting .
rf.interaction.transformer( db, rf, target, num.rf = NA, num.tree, min.pct.obs, min.avg.rate, max.depth, monotonicity, create.interaction.rf, seed = 991 )
rf.interaction.transformer( db, rf, target, num.rf = NA, num.tree, min.pct.obs, min.avg.rate, max.depth, monotonicity, create.interaction.rf, seed = 991 )
db |
Data frame of risk factors and target variable supplied for interaction extraction. |
rf |
Character vector of risk factor names on which decision tree is run. |
target |
Name of target variable (default indicator 0/1) within db argument. |
num.rf |
Number of risk factors randomly selected for each decision tree. If default value ( |
num.tree |
Number of decision trees used for random forest. |
min.pct.obs |
Minimum percentage of observation in each leaf. |
min.avg.rate |
Minimum percentage of defaults in each leaf. |
max.depth |
Maximum number of splits. |
monotonicity |
Logical indicator. If |
create.interaction.rf |
Logical indicator. If |
seed |
Random seed to ensure result reproducibility. |
The command rf.interaction.transformer
returns a list of two data frames. The first data frame provides
the trees summary. The second data frame is a new risk factor extracted from random forest.
#modify risk factors in order to show how the function works with missing values loans$"Account Balance"[1:10] <- NA loans$"Duration of Credit (month)"[c(13, 15)] <- NA rf.it <- rf.interaction.transformer(db = loans, rf = names(loans)[!names(loans)%in%"Creditability"], target = "Creditability", num.rf = NA, num.tree = 3, min.pct.obs = 0.05, min.avg.rate = 0.01, max.depth = 2, monotonicity = TRUE, create.interaction.rf = TRUE, seed = 579) names(rf.it) rf.it[["tree.info"]] tail(rf.it[["interaction"]]) table(rf.it[["interaction"]][, 1], useNA = "always")
#modify risk factors in order to show how the function works with missing values loans$"Account Balance"[1:10] <- NA loans$"Duration of Credit (month)"[c(13, 15)] <- NA rf.it <- rf.interaction.transformer(db = loans, rf = names(loans)[!names(loans)%in%"Creditability"], target = "Creditability", num.rf = NA, num.tree = 3, min.pct.obs = 0.05, min.avg.rate = 0.01, max.depth = 2, monotonicity = TRUE, create.interaction.rf = TRUE, seed = 579) names(rf.it) rf.it[["tree.info"]] tail(rf.it[["interaction"]]) table(rf.it[["interaction"]][, 1], useNA = "always")
rs.calibration
performs calibration of the observed default rates for a given rating scale.
rs.calibration(rs, dr, w, ct, min.pd, method)
rs.calibration(rs, dr, w, ct, min.pd, method)
rs |
Rating scale that contain observed default rate and weights used for optimization. |
dr |
Observed default rate per rating. |
w |
Weights, usually number of observations (clients/accounts) per rating. |
ct |
Value of central tendency to which calibration is performed. |
min.pd |
Minimum probability of default (PD) per rating, as constrain for calibration process. |
method |
Calibration method. Available options are |
Method "scaling"
relies on linear rescaling of observed default rate. Rescaling factor is
calculated as a ratio between ct
and observed portfolio default rate.
Method "log.odds.a"
optimize intercept of logit transformation in a way that makes portfolio default
rate equal to selected central tendency (ct
).
Method "log.odds.ab"
optimize intercept and slope of logit transformation in a way that makes
portfolio default rate equal to selected central tendency (ct
).
For respecting selected constrain of minimum PD (min.pd
), two-stage iterative procedure is implemented.
Additional constrain of maximum PD (100%) is also implemented.
The command rs.calibration
returns a list of two elements. The first element is
vector of calibrated PDs and the second one is dataframe of optimization parameters.
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.33 min.pd <- 0.05 #scaling pd.calib.s <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "scaling") rs$pd.scaling <- pd.calib.s[[1]] #log-odds a pd.calib.a <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.a") rs$pd.log.a <- pd.calib.a[[1]] #log-odds ab pd.calib.ab <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab") rs$pd.log.ab <- pd.calib.ab[[1]] #checks rs sum(rs$pd.scaling * rs$no / sum(rs$no)) sum(rs$pd.log.a * rs$no / sum(rs$no)) sum(rs$pd.log.ab * rs$no / sum(rs$no))
suppressMessages(library(PDtoolkit)) data(loans) #estimate some dummy model mod.frm <- `Creditability` ~ `Account Balance` + `Duration of Credit (month)` + `Age (years)` lr.mod <- glm(mod.frm, family = "binomial", data = loans) summary(lr.mod)$coefficients #model predictions loans$pred <- unname(predict(lr.mod, type = "response", newdata = loans)) #scale probabilities loans$score <- scaled.score(probs = loans$pred, score = 600, odd = 50/1, pdo = 20) #group scores into rating loans$rating <- sts.bin(x = round(loans$score), y = loans$Creditability, y.type = "bina")[[2]] #create rating scale rs <- loans %>% group_by(rating) %>% summarise(no = n(), nb = sum(Creditability), ng = sum(1 - Creditability)) %>% mutate(dr = nb / no) rs #calcualte portfolio default rate sum(rs$dr * rs$no / sum(rs$no)) #calibrate rating scale to central tendency of 27% with minimum PD of 5% ct <- 0.33 min.pd <- 0.05 #scaling pd.calib.s <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "scaling") rs$pd.scaling <- pd.calib.s[[1]] #log-odds a pd.calib.a <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.a") rs$pd.log.a <- pd.calib.a[[1]] #log-odds ab pd.calib.ab <- rs.calibration(rs = rs, dr = "dr", w = "no", ct = ct, min.pd = min.pd, method = "log.odds.ab") rs$pd.log.ab <- pd.calib.ab[[1]] #checks rs sum(rs$pd.scaling * rs$no / sum(rs$no)) sum(rs$pd.log.a * rs$no / sum(rs$no)) sum(rs$pd.log.ab * rs$no / sum(rs$no))
scaled.score
performs scaling of the probabilities for a certain set up. User has to select
three parameters (score, odd, pdo
), while the probabilities (probs
) are usually
predictions of the final model.
scaled.score(probs, score = 600, odd = 50/1, pdo = 20)
scaled.score(probs, score = 600, odd = 50/1, pdo = 20)
probs |
Model predicted probabilities of default. |
score |
Specific score for selected odd (for argument |
odd |
Odd (good/bad) at specific score (for argument |
pdo |
Points for double the odds. Default is 20. |
The command scaled.score
returns a vector of scaled scores.
Siddiqi, N. (2012). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring, John Wiley & Sons, Inc.
suppressMessages(library(PDtoolkit)) data(loans) #run stepFWD res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) final.model <- res$model summary(final.model)$coefficients #overview of development data base head(res$dev.db) #predict probabilities using the final model loans$probs <- predict(final.model, type = "response", newdata = res$dev.db) #scale probabilities to scores loans$score <- scaled.score(probs = loans$probs, score = 600, odd = 50/1, pdo = 20) #check AUC of probabilities and score auc.model(predictions = loans$probs, observed = loans$Creditability) auc.model(predictions = loans$score, observed = ifelse(loans$Creditability == 0, 1, 0)) #note that higher score indicates lower probability of default
suppressMessages(library(PDtoolkit)) data(loans) #run stepFWD res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) final.model <- res$model summary(final.model)$coefficients #overview of development data base head(res$dev.db) #predict probabilities using the final model loans$probs <- predict(final.model, type = "response", newdata = res$dev.db) #scale probabilities to scores loans$score <- scaled.score(probs = loans$probs, score = 600, odd = 50/1, pdo = 20) #check AUC of probabilities and score auc.model(predictions = loans$probs, observed = loans$Creditability) auc.model(predictions = loans$score, observed = ifelse(loans$Creditability == 0, 1, 0)) #note that higher score indicates lower probability of default
segment.vld
performs model segment validation based on residuals.
The main goal of this procedure is to identify segments where model in use overestimates or
underestimates the observed default rate. The procedure consists of a few steps. The first step is to
calculate the model residuals (observed default indicator minus estimated probability).
Then, on obtained residuals, the regression tree is fitted for segment identification.
Finally, one proportion test is applied in order to test overestimation or underestimation
of the observed default rate within these segments. Results of this validation can indicate
omission of some important risk factor(s) or some specific sub-portfolio for which model performs
worse than for the rest of the portfolio.
segment.vld(model, db, min.leaf = 0.03, alpha = 0.05)
segment.vld(model, db, min.leaf = 0.03, alpha = 0.05)
model |
Model in use, an object of class inheriting from |
db |
Modeling data with risk factors and target variable. Risk factors used for |
min.leaf |
Minimum percentage of observations per leaf. Default is 0.03. |
alpha |
Significance level of p-value for one proportion test. Default is 0.05. |
The command segment.vld
returns a list of three objects.
The first object (segment.model
), returns regression tree results (rpart
object).
The second object (segment.testing
), is the data frame with segment overview and testing results.
The third object (segment.rules
), is the data frame with average residual
rate and rules for segment identification. This elements is returned, only if the segments are
identified, otherwise it isNULL
.
suppressMessages(library(PDtoolkit)) library(rpart) data(loans) #run stepMIV res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #run segment validation procedure seg.analysis <- segment.vld(model = final.model, db = res$dev.db, min.leaf = 0.03, alpha = 0.05) #check output elements names(seg.analysis) #print segment model - regression tree seg.analysis$segment.model #print segment summary and statistical testing seg.analysis$segment.testing #print segment identification rules seg.analysis$segment.rules
suppressMessages(library(PDtoolkit)) library(rpart) data(loans) #run stepMIV res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "WoE", db = loans) #check output elements names(res) #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #run segment validation procedure seg.analysis <- segment.vld(model = final.model, db = res$dev.db, min.leaf = 0.03, alpha = 0.05) #check output elements names(seg.analysis) #print segment model - regression tree seg.analysis$segment.model #print segment summary and statistical testing seg.analysis$segment.testing #print segment identification rules seg.analysis$segment.rules
smote
performs type of data augmentation for the selected (usually minority). In order to process continuous and
categorical risk factors simultaneously, Heterogeneity Euclidean Overlapping Metric (HEOM) is used in nearest neighbors
algorithm.
smote( db, target, minority.class, osr, ordinal.rf = NULL, num.rf.const = NULL, k = 5, seed = 81000 )
smote( db, target, minority.class, osr, ordinal.rf = NULL, num.rf.const = NULL, k = 5, seed = 81000 )
db |
Data set of risk factors and target variable. |
target |
Name of target variable within |
minority.class |
Value of minority class. It can be numeric or character value, but it has to exist in target variable. |
osr |
Oversampling rate. It has to be numeric value greater than 0 (for example 0.2 for 20% oversampling). |
ordinal.rf |
Character vector of ordinal risk factors. Default value is |
num.rf.const |
Data frame with constrains for numeric risk factors. It has to contain the following columns:
|
k |
Number of nearest neighbors. Default value is 5. |
seed |
Random seed needed for ensuring the result reproducibility. Default is 81000. |
The command smote
returns a data frame with added synthetic observations for selected minority class.
The data frame contains all variables from db
data frame plus additional variable (smote
) that serves as
indicator for distinguishing between original and synthetic observations.
suppressMessages(library(PDtoolkit)) data(loans) #check numeric variables (note that one of variables is target not a risk factor) names(loans)[sapply(loans, is.numeric)] #define constains of numeric risk factors num.rf.const <- data.frame(rf = c("Duration of Credit (month)", "Credit Amount", "Age (years)"), lower = c(4, 250, 19), upper = c(72, 20000, 75), type = c("integer", "numeric", "integer")) num.rf.const #loans$"Account Balance"[990:1000] <- NA #loans$"Credit Amount"[900:920] <- NA loans.s <- smote(db = loans, target = "Creditability", minority.class = 1, osr = 0.05, ordinal.rf = NULL, num.rf.const = num.rf.const, k = 5, seed = 81000) str(loans.s) table(loans.s$Creditability, loans.s$smote) #select minority class loans.mc <- loans.s[loans.s$Creditability%in%1, ] head(loans.mc)
suppressMessages(library(PDtoolkit)) data(loans) #check numeric variables (note that one of variables is target not a risk factor) names(loans)[sapply(loans, is.numeric)] #define constains of numeric risk factors num.rf.const <- data.frame(rf = c("Duration of Credit (month)", "Credit Amount", "Age (years)"), lower = c(4, 250, 19), upper = c(72, 20000, 75), type = c("integer", "numeric", "integer")) num.rf.const #loans$"Account Balance"[990:1000] <- NA #loans$"Credit Amount"[900:920] <- NA loans.s <- smote(db = loans, target = "Creditability", minority.class = 1, osr = 0.05, ordinal.rf = NULL, num.rf.const = num.rf.const, k = 5, seed = 81000) str(loans.s) table(loans.s$Creditability, loans.s$smote) #select minority class loans.mc <- loans.s[loans.s$Creditability%in%1, ] head(loans.mc)
staged.blocks
performs blockwise regression where the predictions of each blocks' model is used as an
offset for the model of the following block.
staged.blocks( method, target, db, coding = "WoE", blocks, p.value = 0.05, miv.threshold = 0.02, m.ch.p.val = 0.05 )
staged.blocks( method, target, db, coding = "WoE", blocks, p.value = 0.05, miv.threshold = 0.02, m.ch.p.val = 0.05 )
method |
Regression method applied on each block.
Available methods: |
target |
Name of target variable within |
db |
Modeling data with risk factors and target variable. |
coding |
Type of risk factor coding within the model. Available options are: |
blocks |
Data frame with defined risk factor groups. It has to contain the following columns: |
p.value |
Significance level of p-value for the estimated coefficient. For |
miv.threshold |
MIV (marginal information value) entrance threshold applicable only for code"stepMIV" method. Only the risk factors with MIV higher than the threshold are candidate for the new model. Additional criteria is that MIV value should significantly separate good from bad cases measured by marginal chi-square test. |
m.ch.p.val |
Significance level of p-value for marginal chi-square test applicable only for code"stepMIV" method. This test additionally supports MIV value of candidate risk factor for final decision. |
The command staged.blocks
returns a list of three objects.
The first object (model
) is the list of the models of each block (an object of class inheriting from "glm"
).
The second object (steps
), is the data frame with risk factors selected from the each block.
The third object (dev.db
), returns the list of block's model development databases.
Anderson, R.A. (2021). Credit Intelligence & Modelling, Many Paths through the Forest of Credit Rating and Scoring, OUP Oxford
embedded.blocks
, ensemble.blocks
, stepMIV
, stepFWD
,
stepRPC
, stepFWDr
and stepRPCr
.
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(22) blocks <- data.frame(rf = rf.all, block = sample(1:3, length(rf.all), rep = TRUE)) blocks <- blocks[order(blocks$block), ] blocks #method: stepFWDr res <- staged.blocks(method = "stepFWDr", target = "Creditability", db = loans, blocks = blocks) names(res) nb <- length(res[["models"]]) res$models[[nb]] auc.model(predictions = predict(res$models[[nb]], type = "response", newdata = res$dev.db[[nb]]), observed = res$dev.db[[nb]]$Creditability) identical(unname(predict(res$models[[1]], type = "link", newdata = res$dev.db[[1]])), res$dev.db[[2]]$offset.vals) identical(unname(predict(res$models[[2]], type = "link", newdata = res$dev.db[[2]])), res$dev.db[[3]]$offset.vals)
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(22) blocks <- data.frame(rf = rf.all, block = sample(1:3, length(rf.all), rep = TRUE)) blocks <- blocks[order(blocks$block), ] blocks #method: stepFWDr res <- staged.blocks(method = "stepFWDr", target = "Creditability", db = loans, blocks = blocks) names(res) nb <- length(res[["models"]]) res$models[[nb]] auc.model(predictions = predict(res$models[[nb]], type = "response", newdata = res$dev.db[[nb]]), observed = res$dev.db[[nb]]$Creditability) identical(unname(predict(res$models[[1]], type = "link", newdata = res$dev.db[[1]])), res$dev.db[[2]]$offset.vals) identical(unname(predict(res$models[[2]], type = "link", newdata = res$dev.db[[2]])), res$dev.db[[3]]$offset.vals)
stepFWD
customized stepwise regression with p-value and trend check. Trend check is performed
comparing observed trend between target and analyzed risk factor and trend of the estimated coefficients within the
logistic regression. Note that procedure checks the column names of supplied db
data frame therefore some
renaming (replacement of special characters) is possible to happen. For details check help example.
stepFWD( start.model, p.value = 0.05, coding = "WoE", coding.start.model = TRUE, check.start.model = TRUE, db, offset.vals = NULL )
stepFWD( start.model, p.value = 0.05, coding = "WoE", coding.start.model = TRUE, check.start.model = TRUE, db, offset.vals = NULL )
start.model |
Formula class that represents starting model. It can include some risk factors, but it can be
defined only with intercept ( |
p.value |
Significance level of p-value of the estimated coefficients. For |
coding |
Type of risk factor coding within the model. Available options are: |
coding.start.model |
Logical ( |
check.start.model |
Logical ( |
db |
Modeling data with risk factors and target variable. All risk factors (apart from the risk factors from the starting model) should be categorized and as of character type. |
offset.vals |
This can be used to specify an a priori known component to be included in the linear predictor during fitting.
This should be |
The command stepFWD
returns a list of four objects.
The first object (model
), is the final model, an object of class inheriting from "glm"
.
The second object (steps
), is the data frame with risk factors selected at each iteration.
The third object (warnings
), is the data frame with warnings if any observed.
The warnings refer to the following checks: if risk factor has more than 10 modalities,
if any of the bins (groups) has less than 5% of observations and
if there are problems with WoE calculations.
The final, fourth, object dev.db
returns the model development database.
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "dummy", db = loans) summary(res$model)$coefficients rf.check <- tapply(res$dev.db$Creditability, res$dev.db$Instalment_per_cent, mean) rf.check diff(rf.check) res$steps head(res$dev.db)
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) res <- stepFWD(start.model = Creditability ~ 1, p.value = 0.05, coding = "dummy", db = loans) summary(res$model)$coefficients rf.check <- tapply(res$dev.db$Creditability, res$dev.db$Instalment_per_cent, mean) rf.check diff(rf.check) res$steps head(res$dev.db)
stepFWDr
customized stepwise regression with p-value and trend check on raw risk factors. Trend check is performed
comparing observed trend between target and analyzed risk factor and trend of the estimated coefficients within the
binomial logistic regression. Difference between stepFWDr
and stepFWD
is that this function
run stepwise regression on mixed risk factor types (numerical and/or categorical), while stepFWD
accepts only categorical risk factors.
Note that procedure checks the column names of supplied db
data frame therefore some
renaming (replacement of special characters) is possible to happen. For details check help example.
stepFWDr( start.model, p.value = 0.05, db, check.start.model = TRUE, offset.vals = NULL )
stepFWDr( start.model, p.value = 0.05, db, check.start.model = TRUE, offset.vals = NULL )
start.model |
Formula class that represents starting model. It can include some risk factors, but it can be
defined only with intercept ( |
p.value |
Significance level of p-value of the estimated coefficients. For numerical risk factors this value is
is directly compared to the p-value of the estimated coefficients, while for categorical risk factors
multiple Wald test is employed and its p-value is used for comparison with selected threshold ( |
db |
Modeling data with risk factors and target variable. Risk factors can be categorized or continuous. |
check.start.model |
Logical ( |
offset.vals |
This can be used to specify an a priori known component to be included in the linear predictor during fitting.
This should be |
The command stepFWDr
returns a list of four objects.
The first object (model
), is the final model, an object of class inheriting from "glm"
.
The second object (steps
), is the data frame with risk factors selected at each iteration.
The third object (warnings
), is the data frame with warnings if any observed.
The warnings refer to the following checks: if any categorical risk factor has more than 10 modalities or
if any of the bins (groups) has less than 5% of observations.
The final, fourth, object dev.db
returns the model development database.
suppressMessages(library(PDtoolkit)) data(loans) trf <- c("Creditability", "Account Balance", "Duration of Credit (month)", "Age (years)", "Guarantors", "Concurrent Credits") res <- stepFWDr(start.model = Creditability ~ 1, p.value = 0.05, db = loans[, trf], check.start.model = TRUE, offset.vals = NULL) summary(res$model)$coefficients rf.check <- tapply(res$dev.db$Creditability, res$dev.db$Guarantors, mean) rf.check diff(rf.check) res$steps head(res$dev.db)
suppressMessages(library(PDtoolkit)) data(loans) trf <- c("Creditability", "Account Balance", "Duration of Credit (month)", "Age (years)", "Guarantors", "Concurrent Credits") res <- stepFWDr(start.model = Creditability ~ 1, p.value = 0.05, db = loans[, trf], check.start.model = TRUE, offset.vals = NULL) summary(res$model)$coefficients rf.check <- tapply(res$dev.db$Creditability, res$dev.db$Guarantors, mean) rf.check diff(rf.check) res$steps head(res$dev.db)
stepMIV
performs stepwise logistic regression based on MIV.
stepMIV( start.model, miv.threshold, m.ch.p.val, coding, coding.start.model = FALSE, db, offset.vals = NULL )
stepMIV( start.model, miv.threshold, m.ch.p.val, coding, coding.start.model = FALSE, db, offset.vals = NULL )
start.model |
Formula class that represent starting model. It can include some risk factors, but it can be
defined only with intercept ( |
miv.threshold |
MIV entrance threshold. Only the risk factors with MIV higher than the threshold are candidate for the new model. Additional criteria is that MIV value should significantly separate good from bad cases measured by marginal chi-square test. |
m.ch.p.val |
Significance level of p-value for marginal chi-square test. This test additionally supports MIV value of candidate risk factor for final decision. |
coding |
Type of risk factor coding within the model. Available options are: |
coding.start.model |
Logical ( |
db |
Modeling data with risk factors and target variable. All risk factors should be categorized as of character type. |
offset.vals |
This can be used to specify an a priori known component to be included in the linear predictor during fitting.
This should be |
The command stepMIV
returns a list of five objects.
The first object (model
), is the final model, an object of class inheriting from "glm"
.
The second object (steps
), is the data frame with risk factors selected at each iteration.
The third object (miv.iter
), is the data frame with iteration details.
The fourth object (warnings
), is the data frame with warnings if any observed.
The warnings refer to the following checks: if risk factor has more than 10 modalities,
if any of the bins (groups) has less than 5% of observations and
if there are problems with WoE calculations.
The final, fifth, object dev.db
object dev.db
returns the model development database.
Scallan, G. (2011). Class(ic) Scorecards: Selecting Characteristics and Attributes in Logistic Regression, Edinburgh Credit Scoring Conference, downloaded from here.
suppressMessages(library(PDtoolkit)) data(loans) ##identify numeric risk factors #num.rf <- sapply(loans, is.numeric) #num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] ##discretized numeric risk factors using ndr.bin from monobin package #loans[, num.rf] <- sapply(num.rf, function(x) # ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) #str(loans) #run stepMIV rf <- c("Account Balance", "Payment Status of Previous Credit", "Purpose", "Value Savings/Stocks", "Most valuable available asset", "Foreign Worker") res <- stepMIV(start.model = Creditability ~ 1, miv.threshold = 0.02, m.ch.p.val = 0.05, coding = "WoE", coding.start.model = FALSE, db = loans[, c("Creditability", rf)]) #check output elements names(res) #print model warnings res$warnings #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #print steps of stepwise res$steps #print head of all iteration details head(res$miv.iter) #print warnings res$warnings #print head of coded development data head(res$dev.db) #calculate AUC auc.model(predictions = predict(final.model, type = "response", newdata = res$dev.db), observed = res$dev.db$Creditability)
suppressMessages(library(PDtoolkit)) data(loans) ##identify numeric risk factors #num.rf <- sapply(loans, is.numeric) #num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] ##discretized numeric risk factors using ndr.bin from monobin package #loans[, num.rf] <- sapply(num.rf, function(x) # ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) #str(loans) #run stepMIV rf <- c("Account Balance", "Payment Status of Previous Credit", "Purpose", "Value Savings/Stocks", "Most valuable available asset", "Foreign Worker") res <- stepMIV(start.model = Creditability ~ 1, miv.threshold = 0.02, m.ch.p.val = 0.05, coding = "WoE", coding.start.model = FALSE, db = loans[, c("Creditability", rf)]) #check output elements names(res) #print model warnings res$warnings #extract the final model final.model <- res$model #print coefficients summary(final.model)$coefficients #print steps of stepwise res$steps #print head of all iteration details head(res$miv.iter) #print warnings res$warnings #print head of coded development data head(res$dev.db) #calculate AUC auc.model(predictions = predict(final.model, type = "response", newdata = res$dev.db), observed = res$dev.db$Creditability)
stepRPC
customized stepwise regression with p-value and trend check which additionally takes into account
the order of supplied risk factors per group when selects a candidate for the final regression model. Trend check is performed
comparing observed trend between target and analyzed risk factor and trend of the estimated coefficients within the
logistic regression. Note that procedure checks the column names of supplied db
data frame therefore some
renaming (replacement of special characters) is possible to happen. For details, please, check the help example.
stepRPC( start.model, risk.profile, p.value = 0.05, coding = "WoE", coding.start.model = TRUE, check.start.model = TRUE, db, offset.vals = NULL )
stepRPC( start.model, risk.profile, p.value = 0.05, coding = "WoE", coding.start.model = TRUE, check.start.model = TRUE, db, offset.vals = NULL )
start.model |
Formula class that represents the starting model. It can include some risk factors, but it can be
defined only with intercept ( |
risk.profile |
Data frame with defined risk profile. It has to contain the following columns: |
p.value |
Significance level of p-value of the estimated coefficients. For |
coding |
Type of risk factor coding within the model. Available options are: |
coding.start.model |
Logical ( |
check.start.model |
Logical ( |
db |
Modeling data with risk factors and target variable. All risk factors (apart from the risk factors from the starting model) should be categorized and as of character type. |
offset.vals |
This can be used to specify an a priori known component to be included in the linear predictor during fitting.
This should be |
The command stepRPC
returns a list of four objects.
The first object (model
), is the final model, an object of class inheriting from "glm"
.
The second object (steps
), is the data frame with risk factors selected at each iteration.
The third object (warnings
), is the data frame with warnings if any observed.
The warnings refer to the following checks: if risk factor has more than 10 modalities,
if any of the bins (groups) has less than 5% of observations and
if there are problems with WoE calculations.
The final, fourth, object dev.db
returns the model development database.
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(591) rf.pg <- data.frame(rf = rf.all, group = sample(1:3, length(rf.all), rep = TRUE)) head(rf.pg) #bring AUC for each risk factor in order to sort them within groups bva <- bivariate(db = loans, target = "Creditability")[[1]] rf.auc <- unique(bva[, c("rf", "auc")]) rf.pg <- merge(rf.pg, rf.auc, by = "rf", all.x = TRUE) #prioritized risk factors rf.pg <- rf.pg[order(rf.pg$group, rf.pg$auc), ] rf.pg <- rf.pg[order(rf.pg$group), ] rf.pg res <- stepRPC(start.model = Creditability ~ 1, risk.profile = rf.pg, p.value = 0.05, coding = "WoE", db = loans) summary(res$model)$coefficients res$steps head(res$dev.db)
suppressMessages(library(PDtoolkit)) data(loans) #identify numeric risk factors num.rf <- sapply(loans, is.numeric) num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf] #discretized numeric risk factors using ndr.bin from monobin package loans[, num.rf] <- sapply(num.rf, function(x) ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]]) str(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(591) rf.pg <- data.frame(rf = rf.all, group = sample(1:3, length(rf.all), rep = TRUE)) head(rf.pg) #bring AUC for each risk factor in order to sort them within groups bva <- bivariate(db = loans, target = "Creditability")[[1]] rf.auc <- unique(bva[, c("rf", "auc")]) rf.pg <- merge(rf.pg, rf.auc, by = "rf", all.x = TRUE) #prioritized risk factors rf.pg <- rf.pg[order(rf.pg$group, rf.pg$auc), ] rf.pg <- rf.pg[order(rf.pg$group), ] rf.pg res <- stepRPC(start.model = Creditability ~ 1, risk.profile = rf.pg, p.value = 0.05, coding = "WoE", db = loans) summary(res$model)$coefficients res$steps head(res$dev.db)
stepRPCr
customized stepwise regression with p-value and trend check on raw risk factors which additionally takes into account
the order of supplied risk factors per group when selects a candidate for the final regression model. Trend check is performed
comparing observed trend between target and analyzed risk factor and trend of the estimated coefficients.
Note that procedure checks the column names of supplied db
data frame therefore some
renaming (replacement of special characters) is possible to happen. For details, please, check the help example.
stepRPCr( start.model, risk.profile, p.value = 0.05, db, check.start.model = TRUE, offset.vals = NULL )
stepRPCr( start.model, risk.profile, p.value = 0.05, db, check.start.model = TRUE, offset.vals = NULL )
start.model |
Formula class that represents the starting model. It can include some risk factors, but it can be
defined only with intercept ( |
risk.profile |
Data frame with defined risk profile. It has to contain the following columns: |
p.value |
Significance level of p-value of the estimated coefficients. For numerical risk factors this value is
is directly compared to the p-value of the estimated coefficients, while for categorical risk factors
multiple Wald test is employed and its value is used for comparison with selected threshold ( |
db |
Modeling data with risk factors and target variable. All risk factors (apart from the risk factors from the starting model) should be categorized and as of character type. |
check.start.model |
Logical ( |
offset.vals |
This can be used to specify an a priori known component to be included in the linear predictor during fitting.
This should be |
The command stepRPCr
returns a list of four objects.
The first object (model
), is the final model, an object of class inheriting from "glm"
.
The second object (steps
), is the data frame with risk factors selected at each iteration.
The third object (warnings
), is the data frame with warnings if any observed.
The warnings refer to the following checks: if risk factor has more than 10 modalities or
if any of the bins (groups) has less than 5% of observations.
The final, fourth, object dev.db
returns the model development database.
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(6422) rf.pg <- data.frame(rf = rf.all, group = sample(1:3, length(rf.all), rep = TRUE)) rf.pg <- rf.pg[order(rf.pg$group), ] head(rf.pg) res <- stepRPCr(start.model = Creditability ~ 1, risk.profile = rf.pg, p.value = 0.05, db = loans) summary(res$model)$coefficients res$steps head(res$dev.db)
suppressMessages(library(PDtoolkit)) data(loans) #create risk factor priority groups rf.all <- names(loans)[-1] set.seed(6422) rf.pg <- data.frame(rf = rf.all, group = sample(1:3, length(rf.all), rep = TRUE)) rf.pg <- rf.pg[order(rf.pg$group), ] head(rf.pg) res <- stepRPCr(start.model = Creditability ~ 1, risk.profile = rf.pg, p.value = 0.05, db = loans) summary(res$model)$coefficients res$steps head(res$dev.db)
univariate
returns the univariate statistics for risk factors supplied in data frame db
.
For numeric risk factors univariate report includes:
rf: Risk factor name.
rf.type: Risk factor class. This metric is always equal to numeric
.
bin.type: Bin type - special or complete cases.
bin: Bin type. If a sc.method
argument is equal to "together"
, then
bin
and bin.type
have the same value. If the sc.method
argument
is equal to "separately"
, then the bin
will contain all special cases that
exist for analyzed risk factor (e.g. NA
, NaN
, Inf
).
pct: Percentage of observations in each bin
.
cnt.unique: Number of unique values per bin
.
min: Minimum value.
p1, p5, p25, p50, p75, p95, p99: Percentile values.
avg: Mean value.
avg.se: Standard error of the mean.
max: Maximum value.
neg: Number of negative values.
pos: Number of positive values.
cnt.outliers: Number of outliers. Records above or below
Q75
1.5 * IQR
, where IQR = Q75 - Q25
.
sc.ind: Special case indicator. It takes value 1 if share of special cases exceeds
sc.threshold
otherwise 0.
For categorical risk factors univariate report includes:
rf: Risk factor name.
rf.type: Risk factor class. This metric is equal to one of: character
,
factor
or logical
.
bin.type: Bin type - special or complete cases.
bin: Bin type. If a sc.method
argument is equal to "together"
, then
bin
and bin.type
have the same value. If the sc.method
argument
is equal to "separately"
, then the bin
will contain all special cases that
exist for analyzed risk factor (e.g. NA
, NaN
, Inf
).
pct: Percentage of observations in each bin
.
cnt.unique: Number of unique values per bin
.
sc.ind: Special case indicator. It takes value 1 if share of special cases exceeds
sc.threshold
otherwise 0.
univariate( db, sc = c(NA, NaN, Inf, -Inf), sc.method = "together", sc.threshold = 0.2 )
univariate( db, sc = c(NA, NaN, Inf, -Inf), sc.method = "together", sc.threshold = 0.2 )
db |
Data frame of risk factors supplied for univariate analysis. |
sc |
Vector of special case elements. Default values are |
sc.method |
Define how special cases will be treated, all together or in separate bins.
Possible values are |
sc.threshold |
Threshold for special cases expressed as percentage of total number of observations.
If |
The command univariate
returns the data frame with explained univariate metrics for numeric,
character, factor and logical class of risk factors.
suppressMessages(library(PDtoolkit)) data(gcd) gcd$age[100:120] <- NA gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, y.type = "bina")[[2]] gcd$age.bin <- as.factor(gcd$age.bin) gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual, y.type = "bina")[[2]] gcd$all.miss1 <- NaN gcd$all.miss2 <- NA gcd$tf <- sample(c(TRUE, FALSE), nrow(gcd), rep = TRUE) #create date variable to confirm that it will not be processed by the function gcd$dates <- Sys.Date() str(gcd) univariate(db = gcd)
suppressMessages(library(PDtoolkit)) data(gcd) gcd$age[100:120] <- NA gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual, y.type = "bina")[[2]] gcd$age.bin <- as.factor(gcd$age.bin) gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]] gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual, y.type = "bina")[[2]] gcd$all.miss1 <- NaN gcd$all.miss2 <- NA gcd$tf <- sample(c(TRUE, FALSE), nrow(gcd), rep = TRUE) #create date variable to confirm that it will not be processed by the function gcd$dates <- Sys.Date() str(gcd) univariate(db = gcd)
ush.bin
performs U-shape binning. All algorithms from monobin
package are available. Due to specific nature of
binning algorithms it is possible that for some selected knots algorithm will not be able to find U-shape. Therefore,
users are encourage to inspect the results more into details and to try different binning algorithms.
ush.bin( x, y, knot, method, sc = c(NA, Inf, -Inf, NaN), sc.method = "together", g = 20, min.pct.obs = 0.05, min.avg.rate = 0.01, p.val = 0.05, woe.trend = TRUE, woe.gap = 0.1 )
ush.bin( x, y, knot, method, sc = c(NA, Inf, -Inf, NaN), sc.method = "together", g = 20, min.pct.obs = 0.05, min.avg.rate = 0.01, p.val = 0.05, woe.trend = TRUE, woe.gap = 0.1 )
x |
Numeric vector to be binned. |
y |
Numeric target vector (binary). |
knot |
Numeric value of selected knot. Usually the results of |
method |
Binning method. Available options are all from monobin package:
|
sc |
Numeric vector with special case elements. Default values are |
sc.method |
Define how special cases will be treated, all together or separately.
Possible values are |
g |
Number of starting groups. Only needed for |
min.pct.obs |
Minimum percentage of observations per bin. Default is 0.05 or 30 observations. |
min.avg.rate |
Minimum |
p.val |
Threshold for p-value. Only needed for |
woe.trend |
Logical. Only needed for |
woe.gap |
Minimum WoE gap between bins. Only needed for |
The command ush.bin
generates a list of two objects. The first object, data frame summary.tbl
presents a summary table of final binning, while x.trans
is a vector of discretized values.
res <- ush.bin(x = gcd$amount, y = gcd$qual, knot = 2992.579, method = "ndr.bin") res[[1]] plot(res[[1]]$dr, type = "l")
res <- ush.bin(x = gcd$amount, y = gcd$qual, knot = 2992.579, method = "ndr.bin") res[[1]] plot(res[[1]]$dr, type = "l")
ush.test
performs U-shape testing between the target and analyzed risk factor.
Testing is based on B-splines basis functions and change of the sign of the estimated coefficients.
ush.test( x, y, p.value = 0.05, min.pct.obs = 0.05, min.pct.def = 0.01, g = 20, sc = c(NA, Inf, -Inf, NaN) )
ush.test( x, y, p.value = 0.05, min.pct.obs = 0.05, min.pct.def = 0.01, g = 20, sc = c(NA, Inf, -Inf, NaN) )
x |
Numeric vector to be tested for U-shape. |
y |
Numeric target vector (binary). |
p.value |
Threshold for p-value of statistical significance of the estimated coefficients next to basis functions. Default is 0.05. |
min.pct.obs |
Minimum percentage of observations per bin. Default is 0.05. |
min.pct.def |
Minimum |
g |
Number of knots used for testing the U-shape (integer). It should take values between 2 and 50 with default value of 20. |
sc |
Numeric vector with special case elements. Default values are |
The command ush.test
returns list of three objects. The first object (candidates
)
is the data frame with summary of tested candidate knots. Using the reported results of this data frame
user can conclude if U-shape exists at all (column where direction
is equal to TRUE
) and
check its statistical significance (column significance
- TRUE, FALSE
).
The second object (optimal
) reports optimal knot value (if exists).
It is selected as the knot with minimum deviance among all candidates for
which direction
and significance
are equal to TRUE
.
The last, third, object (basis.functions
) exports basis functions for optimal knot. Basis functions
will be exported only in case optimal knot is found.
If optimal knot is not found, then users are encouraged to inspect closer the results of candidate testing.
data(gcd) res <- ush.test(x = gcd$amount, y = gcd$qual) res #optimal knot is not found so candidate can be defined as follows: direction.t <- res$candidate[res$candidate$direction, ] optimal.k <- direction.t$cp[direction.t$deviance%in%min(direction.t$deviance)] optimal.k
data(gcd) res <- ush.test(x = gcd$amount, y = gcd$qual) res #optimal knot is not found so candidate can be defined as follows: direction.t <- res$candidate[res$candidate$direction, ] optimal.k <- direction.t$cp[direction.t$deviance%in%min(direction.t$deviance)] optimal.k
woe.tbl
calculates WoE and information value for given target variable and risk factor along with
accompanied metrics needed for their calculation.
WoE table reports:
bin: Risk factor group (bin).
no: Number of observations per bin.
ng: Number of good cases (where target is equal to 0) per bin.
nb: Number of bad cases (where target is equal to 1) per bin.
pct.o: Percentage of observations per bin.
pct.g: Percentage of good cases (where target is equal to 0) per bin.
pct.b: Percentage of bad cases (where target is equal to 1) per bin.
dr: Default rate per bin.
so: Number of all observations.
sg: Number of all good cases.
sb: Number of all bad cases.
dist.g: Distribution of good cases per bin.
dist.b: Distribution of bad cases per bin.
woe: WoE value.
iv.b: Information value per bin.
iv.s: Information value of risk factor (sum of individual bins' information values).
woe.tbl(tbl, x, y, y.check = TRUE)
woe.tbl(tbl, x, y, y.check = TRUE)
tbl |
Data frame which contains target variable ( |
x |
Selected risk factor. |
y |
Selected target variable. |
y.check |
Logical, if target variable ( |
The command woe.tbl
returns the data frame with WoE and information value calculations along with accompanied metrics.
bivariate
for automatic bivariate analysis.
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factors gcd$age.bin <- woe.bin(x = gcd$age, y = gcd$qual, y.type = "bina")[[2]] #generate woe table woe.tbl(tbl = gcd, x = "age.bin", y = "qual")
suppressMessages(library(PDtoolkit)) data(gcd) #categorize numeric risk factors gcd$age.bin <- woe.bin(x = gcd$age, y = gcd$qual, y.type = "bina")[[2]] #generate woe table woe.tbl(tbl = gcd, x = "age.bin", y = "qual")