Assignment completed for Public Policy Analytics course at the University of Pennsylvania.
Emil City’s (fictional) Department of Housing and Community Development (HCD) offers a robust home repair tax credit program. We provide homeowners who complete our credit program $5,000 towards home repairs of any kind.
Homeowners who have participated in our program have reported high satisfaction with the program. Typically, homeowners see an increase in their home value of $10,000 from making a repair with our credit. Even their neighbors see a benefit - on average, neighbors near a repaired home have a combined house value increase of $56,000.
Despite these clear benefits, HCD has struggled to find homeowners willing to enroll in the program. To increase participation in this program, I am going to use HCD’s previous campaign data to better target program outreach, and thus increase the benefits of the program across Emil City.
# Set up
options(scipen=10000000)
library(tidyverse)
library(kableExtra)
library(caret)
library(knitr)
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
#library(jtools)
#library(ggstance)
library(sjPlot)
#install.packages("janitor")
library(janitor)
library(gridExtra)
palette2 <- c("#cc4778","#0d0887")
palette1 = c("#0d0887", "#7e03a8", "#cc4778", "#f89540", "#f0f921")
palette3 = c("#f0f921", "#f89540", "#cc4778", "#7e03a8", "#0d0887")
# Read in data
info <- read_csv("C:\\Users\\radams\\src\\PubPolicyAnalytics\\HW4\\508_HW4\\data\\housingSubsidy.csv")
I’m going to engineer a few features from the data set to help with my data exploration and modeling. In particular, I try to build features that make sense for the context of the housing subsidy. For instance, I create a feature that categorizes the season that the credit is offered in. The logic behind this is that spring and summer is often considered the season of home improvement, and so people may be more interested in the offered credit during these seasons.
Although this code is presented before data exploration, most of this feature engineering took place concurrently to data exploration.
info <- info %>%
select(-cons.price.idx) %>%
mutate(
# ID months in spring/summertime
season = (ifelse(month %in% c("apr", "may", "jun", "jul", "aug"), "spring_summer", "fall_winter")),
# tease out four seasons
season4 = (case_when(
month %in% c("mar", "apr", "may") ~ "spring",
month %in% c("jun", "jul", "aug") ~ "summer",
month %in% c("sep", "oct", "nov") ~ "fall",
month %in% c("dec") ~ "winter")),
# Create age groups
age_group = case_when(
age < 30 ~ "18-29",
age >= 30 & age < 40 ~ "30-39",
age >= 40 & age < 50 ~ "40-49",
age >= 50 & age < 60 ~ "50-59",
age >= 60 & age < 70 ~ "60-69",
age >= 70 ~ "70+"),
# create group for probable FHA mortgage holders (proxy for first time house owners)
first_timers = if_else(age < 40 & mortgage == "yes" & marital %in% c("married", "single"), "first_timers", "not_FHA"),
# Group retirees together
retirees = if_else(job == "retired", "retiree", "not_retired"),
# Contacted or not
contacted_ever = if_else(pdays == 999, "not contacted", "contacted"),
# Amount spent on repairs is less/more than the credit
repairs_amount = case_when(
spent_on_repairs <= 5000 ~ "within_credit",
spent_on_repairs > 5000 & spent_on_repairs <= 10000 ~ "beyond_credit"),
# Group jobs together
job_group = if_else(
(job %in% c("student", "unemployed")), "limited_means", "means"),
# group inflation
inflation_group = if_else(inflation_rate >= 4, "high", "low"),
# group consumer confidence index
cons_conf_group = case_when(
cons.conf.idx < -40 ~ "very_low",
cons.conf.idx > -40 & cons.conf.idx < -35 ~ "medium_low",
cons.conf.idx > -35 ~ "low"),
# education groups
edu_group = if_else(
education %in% c("university.degree", "professional.course"),"advanced", "other"),
summer_fridays = if_else(season4 == "summer" & day_of_week == "fri", "summer_friday", "not_summer_friday"),
over_50 = if_else(age >= 50, "over_50", "under_50"),
unemp_group = case_when(
unemploy_rate < -2 ~ "very low",
unemploy_rate > -2 & unemploy_rate < -1 ~ "lower",
unemploy_rate > -1 & unemploy_rate < 0 ~ "low",
unemploy_rate > 0 ~ "positive"),
# cons_price_group = case_when(
# cons.price.idx < 92.7 ~ "under_92.7",
# cons.price.idx >= 92.7 & cons.price.idx < 93.3 ~ "93ish",
# cons.price.idx >= 93.3 & cons.price.idx < 93.7 ~ "93.5ish",
# cons.price.idx >= 93.5 & cons.price.idx < 94.3 ~ "94ish",
# cons.price.idx >= 94.3 ~ "94+ish"),
log_age = log(age),
#log_unemp = log(unemploy_rate)+1,
#log_cons_conf_idx = log(cons.conf.idx)+1,
#log_cons_price_idx = log(cons.price.idx),
log_inflation = log(inflation_rate),
log_spend = log(spent_on_repairs),
persistent_elders = if_else(age > 60 & previous > 0, "persistent_elder", "non_persistent_elder"),
mature_white_collar = if_else(job %in% c("admin.", "technician") & edu_group == "advanced" & over_50 == "over_50" & previous > 0, "mat_white_collar", "not_white_collar")
)
Age: Below is a density plot of the original age variable. This density plot shows that the program previously primarily targeted people around 30-40 years old. Despite this, these groups had more “nos” than “yeses”. After about age 60, it appears that more people agreed to enroll in the program. This indicates that a feature built around this older age group may offer a path towards more “yeses”. The second graph illustrates the average age for groups that said “no” and “yes” respectively. It appears that there is not much difference between the two groups according to this bar plot: the “yes” group, at an average age of about 42, tends to be slightly older than the “no” group (average age is about 40).
grid.arrange(ncol=2,
info %>%
dplyr::select(y, age) %>%
gather(Variable, value, -y) %>%
ggplot() +
geom_density(aes(value, color=y), fill = "transparent") +
facet_wrap(~Variable, scales = "free") +
scale_color_manual(values = palette2) +
labs(title = "Feature distributions credit vs.\nno credit",
subtitle = "(continous outcomes)"),
info %>%
dplyr::select(y, age) %>%
gather(Variable, value, -y) %>%
ggplot(aes(y, value, fill=y)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(x="y", y="Value",
title = "Feature associations with the\nlikelihood of taking the credit",
subtitle = "(continous outcomes)") +
theme(legend.position = "none"))
Day of the Week: Below is a graph of average responses to the campaign for different days of the week. It appears that there is not much correlation between the day of the week and whether a person opted in to the program.
info %>%
dplyr::select(y, day_of_week) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="y", y="Value",
title = "Feature associations with the likelihood of taking the credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Season: I created a feature that classifies the month we last contacted a houseowner into its associated season: spring/summer or fall/winter. My hypothesis is that more people are interested in taking on house improvement projects during the warmer months. This hypothesis is borne out according to the graph below. More people took the credit in summer months:
info %>%
dplyr::select(y, season) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="y", y="Value",
title = "Feature associations with the likelihood of taking the credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
First-timers: I created a feature to proxy new house owners. This category includes all people below 40 years old who have a mortgage and are either married or single. Interestingly, people who were not identified as first timers took the credit more often.
info %>%
dplyr::select(y, first_timers) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="y", y="Value",
title = "Feature associations with the likelihood of taking the credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Repairs amount: I log-transformed the original variable of the average amount people spent on their home repairs. It appears that as the logarithmic value of their repairs increases, people are less likely to enroll in the credit program.
info %>%
dplyr::select(y, log_spend) %>%
gather(Variable, value, -y) %>%
ggplot() +
geom_density(aes(value, color=y), fill = "transparent") +
facet_wrap(~Variable, scales = "free") +
scale_color_manual(values = palette2) +
labs(title = "Feature distributions credit vs. no credit",
subtitle = "(continous outcomes)")
Persistent Elders: I created a variable to capture homeowners over the age of 60 who had been previously contacted by the campaign. I found that this group was as likely to enroll in the program as not. While this 50/50 split may not seem favorable, the high proportion of “nos” across the dataset made this a key insight. However, since this group was only about 16 members, it wasn’t ultimately useful for the model.
info %>%
dplyr::select(y, persistent_elders) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
filter(value == "persistent_elder") %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="y", y="Value",
title = "Feature associations with the likelihood of taking the credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
My first model is the kitchen sink model, which includes all of the original and engineered variables. This is unlikely to be a successful model, but the results can be helpful to understand what may be important for my final model.
info <- info %>%
na.omit() %>%
mutate(y_fct = factor(y, levels = c("yes","no"))) %>%
select(-`...1`)
set.seed(4119)
trainIndex <- createDataPartition(
y = paste(info$education, info$taxLien), p = .65, # 65 training/35 test set
list = FALSE,
times = 1)
subsidyTrain <- info[ trainIndex,]
subsidyTest <- info[-trainIndex,]
# Kitchen sink model
kitchen_sink <- glm(subsidyTrain$y_fct ~ .,
data=subsidyTrain %>%
dplyr::select(-y, -y_numeric),
family="binomial" (link="logit"))
#summary(kitchen_sink) # AIC: 1502
#pR2(kitchen_sink) # McFadden: 0.256
kitchen_sink_summary <- tab_model(kitchen_sink, show.reflvl = T, show.intercept = T, p.style = "numeric_stars")
kitchen_sink_summary
subsidyTrain$y_fct | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 0.00 | 0.00 – 384302622622991782122466022886440264880600240.00 | 0.504 |
age | 1.03 | 0.85 – 1.26 | 0.749 |
age_group30-39 | 0.62 | 0.66 – 1.28 | 0.204 |
age_group40-49 | 0.52 | 0.56 – 0.48 | 0.275 |
age_group50-59 | 0.27 | 0.31 – 0.23 | 0.099 |
age_group60-69 | 0.18 | 0.23 – 1.54 | 0.117 |
age_group70+ | 0.12 | 0.07 – 0.33 | 0.201 |
campaign | 1.07 | 0.99 – 1.17 | 0.093 |
cons.conf.idx | 1.16 | 0.97 – 1.39 | 0.114 |
cons_conf_groupmedium_low | 2.10 | 1.40 – 14.86 | 0.457 |
cons_conf_groupvery_low | 6.07 | 0.54 – 4.94 | 0.136 |
contacted_evernot contacted | 0.00 | 14.87 – 0.00 | 0.914 |
contacttelephone | 2.30 ** | 1.28 – 4.38 | 0.007 |
day_of_weekmon | 1.23 | 1.11 – 2.14 | 0.462 |
day_of_weekthu | 1.38 | 0.78 – 1.31 | 0.267 |
day_of_weektue | 1.13 | 0.64 – 1.98 | 0.671 |
day_of_weekwed | 1.23 | 0.69 – 2.18 | 0.484 |
educationbasic.6y | 0.68 | 0.73 – 1.63 | 0.370 |
educationbasic.9y | 0.76 | 0.43 – 1.52 | 0.439 |
educationhigh.school | 0.76 | 0.81 – 1.47 | 0.435 |
educationilliterate | 7855649312.24 | 0.00 – Inf | 1.000 |
educationprofessional.course | 0.74 | 0.35 – 1.55 | 0.433 |
educationuniversity.degree | 0.73 | 0.36 – 1.42 | 0.361 |
educationunknown | 0.84 | 0.34 – 2.15 | 0.712 |
first_timersnot_FHA | 0.76 | 0.44 – 0.72 | 0.336 |
inflation_grouplow | 2758457335.03 | 0.00 – NA | 1.000 |
inflation_rate | 0.00 ** | 0.00 – 0.20 | 0.005 |
jobblue-collar | 1.30 | 0.75 – 1.27 | 0.350 |
jobentrepreneur | 1.74 | 0.71 – 5.25 | 0.266 |
jobhousemaid | 0.93 | 0.37 – 2.68 | 0.888 |
jobmanagement | 1.82 | 1.92 – 1.72 | 0.062 |
jobretired | 1.75 | 1.55 – 4.35 | 0.216 |
jobself-employed | 1.94 | 0.87 – 4.79 | 0.126 |
jobservices | 1.20 | 0.67 – 1.15 | 0.548 |
jobstudent | 1.25 | 0.47 – 3.48 | 0.659 |
jobtechnician | 1.11 | 0.68 – 1.82 | 0.675 |
jobunemployed | 1.10 | 0.48 – 2.80 | 0.833 |
jobunknown | 0.96 | 0.23 – 1.78 | 0.958 |
log_age | 1.12 | 0.34 – 1722.52 | 0.976 |
log_inflation | 316.26 | 0.34 – 342550.85 | 0.100 |
maritalmarried | 1.31 | 0.81 – 2.08 | 0.261 |
maritalsingle | 1.08 | 0.62 – 1.88 | 0.777 |
maritalunknown | 1.08 | 0.13 – 7.88 | 0.950 |
mature_white_collarnot_white_collar | 2.31 | 2.58 – 2.06 | 0.228 |
monthaug | 0.96 | 1.15 – 0.84 | 0.972 |
monthdec | 0.69 | 0.08 – 6.26 | 0.746 |
monthjul | 0.57 | 0.06 – 1.43 | 0.630 |
monthjun | 0.28 | 0.32 – 0.31 | 0.262 |
monthmar | 0.47 | 0.15 – 1.47 | 0.192 |
monthmay | 1.03 | 0.48 – 2.18 | 0.930 |
monthnov | 2.69 | 4.14 – 3.28 | 0.253 |
monthoct | 1.26 | 0.22 – 7.24 | 0.792 |
monthsep | 1.47 | 0.29 – 7.60 | 0.643 |
mortgageunknown | 1.27 | 0.46 – 4.30 | 0.672 |
mortgageyes | 1.00 | 0.67 – 0.97 | 0.997 |
pdays | 1.01 | 0.88 – 1.15 | 0.906 |
persistent_elderspersistent_elder | 0.34 | 0.06 – 1.76 | 0.197 |
poutcomenonexistent | 0.31 ** | 0.13 – 0.67 | 0.004 |
poutcomesuccess | 0.30 | 0.06 – 1.62 | 0.160 |
previous | 0.58 * | 0.33 – 0.95 | 0.042 |
repairs_amountwithin_credit | 0.66 | 0.87 – 25.82 | 0.822 |
spent_on_repairs | 1.01 | 0.98 – 1.01 | 0.358 |
summer_fridayssummer_friday | 1.30 | 1.39 – 1.21 | 0.500 |
taxbill_in_phlyes | 1.06 | 1.09 – 1.56 | 0.756 |
taxLienunknown | 0.88 | 0.71 – 0.80 | 0.567 |
taxLienyes | 2988974.39 | 0.00 – Inf | 0.998 |
unemp_grouplower | 0.00 | NA – Inf | 1.000 |
unemp_grouppositive | 22.85 | 26.41 – 606.90 | 0.061 |
unemp_groupvery low | 0.00 | NA – Inf | 1.000 |
unemploy_rate | 2.45 | 2.71 – 2.20 | 0.229 |
Observations | 2671 | ||
R2 Tjur | 0.254 | ||
|
My final selected model is below. This model was selected after an iterative process of comparing 21 different models. This model predicts homeowners’ decision to enroll in the program based on the month they were contacted; the log-transformed value of the amount the homeowners spend on repairs; and how they are contacted.
# MODEL 20
model_20_vars = c("month", "log_spend", "contact")
model_20 <- glm(subsidyTrain$y_fct ~ .,
data=subsidyTrain %>%
dplyr::select(all_of(model_20_vars)),
family="binomial" (link="logit"))
#summary(model_20) # AIC: 1488.1
#pR2(model_20) # MF: 0.206
model_summary <- tab_model(model_20, show.reflvl = T, show.intercept = T, p.style = "numeric_stars")
model_summary
subsidyTrain$y_fct | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 0.00 *** | 0.00 – 0.00 | <0.001 |
contacttelephone | 1.83 ** | 1.23 – 2.78 | 0.004 |
log_spend | 3886113395056675785266666208.00 *** | 211383397079214022720046.00 – 82352459382323025392242280482264.00 | <0.001 |
monthaug | 0.55 | 0.29 – 0.99 | 0.052 |
monthdec | 0.28 * | 0.08 – 0.91 | 0.036 |
monthjul | 0.46 * | 0.24 – 0.86 | 0.017 |
monthjun | 0.29 *** | 0.15 – 0.55 | <0.001 |
monthmar | 0.18 *** | 0.08 – 0.43 | <0.001 |
monthmay | 1.28 | 0.70 – 2.26 | 0.409 |
monthnov | 0.89 | 0.46 – 1.70 | 0.726 |
monthoct | 0.70 | 0.31 – 1.59 | 0.401 |
monthsep | 0.73 | 0.32 – 1.66 | 0.452 |
Observations | 2671 | ||
R2 Tjur | 0.193 | ||
|
# Kitchen sink CV
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
kitchen_sink_cvFit <- train(y_fct ~ ., data = info %>%
dplyr::select(-y, -y_numeric),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
#kitchen_sink_cvFit
# Selected model CV
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
model_cvFit <- train(y_fct ~ ., data = info %>%
dplyr::select(y_fct, all_of(model_20_vars)),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
#model_cvFit
The Goodness of Fit Metrics for the cross-validations of both my models are presented below. The figures below plot the distribution of Area Under the Curve (AUC), Sensitivity (the rate at which it correctly predicts people will take the credit - true positives), and Specificity (the rate at which is correctly predicts people will not take the credit - true negatives) across 100 folds.
Kitchen Sink Model My Kitchen Sink model resulted in an AUC of 0.759, a sensitivity of 0.21, and specificity of 0.98. This model does not generalize well with respect to Sensitivity, suggesting it is likely to overpredict true positives. Its Specificity is very high, suggesting it will correctly predict true negatives most of the time. Its AUC is 0.759, indicating a high ratio of true positives to false positives.
dplyr::select(kitchen_sink_cvFit$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(kitchen_sink_cvFit$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#7e03a8") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#f89540", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics - Kitchen Sink Model",
subtitle = "Across-fold mean reprented as dotted lines") +
theme_bw()
Selected Model My selected model resulted in an AUC of 0.77, a sensitivity of 0.139, and specificity of 0.98. This model does not generalize well with respect to Sensitivity - this means that this selected model is likely to incorrectly predict true positives most of the time. Its specificity is very high, indicating that it is likely to correctly predict true negatives most of the time. Its AUC is 0.77, indicating that the model predicts true positives at a satisfactory rate compared to false positives.
dplyr::select(model_cvFit$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(model_cvFit$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#7e03a8") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#f89540", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics - Selected Model",
subtitle = "Across-fold mean reprented as dotted lines") +
theme_bw()
Predictions : I split my data set into a 65% training and 35% testing set. After training my selected model on the training set, I then made predictions on the testing set. After that, I created a confusion matrix of the outcomes in order to create an ROC curve (next section).
testProbs <- data.frame(Outcome = as.factor(subsidyTest$y_fct),
Probs = predict(model_20, subsidyTest, type= "response"))
testProbs <-
testProbs %>%
mutate(predOutcome = factor(
ifelse(testProbs$Probs > 0.95 , "yes", "no"),
levels = c("yes", "no"))) # ensure that Probs' levels match Outcome's levels.
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "yes")
Below is the ROC (Receiver Operating Characteristic) curve for my selected model. The total area under the curve (AUC) is 0.754, which suggests that the model’s ratio of true positives to false positives is satisfactory. This model does a better job at predicting who will take the credit than flipping a coin would.
From this plot, we see that when this model correctly predicts someone will enroll in the program at a rate of 25%, the model will also make the same prediction, but incorrectly at a rate of 10%. It appears that the best trade-off of these correct and incorrect positive predictions is when the model makes correct positive predictions about 80% of the time, and incorrect positive predictions about 35% of the time.
Model
auc(testProbs$Outcome, testProbs$Probs)
ggplot(testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#cc4778") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve - Selected Model",
subtitle = "Area under the curve: 0.754")+
theme_bw()
Calculating costs and benefits for a public service good is not as straight forward as it may be for other goods and services. For the purposes of this analysis, I’ve calculated the costs and benefits as follows:
kable(df,
caption = "Cost/Benefit Equations (all figures in USD)",
format.args = list(big.mark = ",")) %>%
kable_classic_2(full_width = F) %>%
kable_styling("striped")
Category | Description | Cost_Equation | Cost_per_Enrollee | Benefit_Equation | Benefit_per_Enrollee | Net_Benefit_per_Enrollee |
---|---|---|---|---|---|---|
True Positive | Cost per person who enrolls in the program, as predicted | 2,850 + (5,000 * 25% of enrollees) | 4,100 | (10,000 + 56,000) * 25% of enrollees | 16,500 | 12,400 |
True Negative | Cost per person who doesn’t enroll in the program, as predicted | 0 | 0 | 0 | 0 | 0 |
False Positive | Cost per person who doesn’t enroll in the program, but was predicted to do so | 2,850 | 2,850 | 0 | 0 | -2,850 |
False Negative | Cost per person who does enroll in the program, but was predicted not to | 5,000 * 25% of enrollees | 1,250 | (10,000 + 56,000) * 25% of enrollees | 16,500 | 15,250 |
Costs: Financial costs are calculated as described in the table above. The cost per person for outreach is $2,850. This is only spent when the model predicts the person will enroll in the program. The cost to provide a credit is $5,000, which will be provided to everyone who completes the credit program (an estimated 25% of people who enroll in the program).
Benefits: The financial benefits of the program only exist for people who complete the program (estimated at 25% of those who enroll). For this group, the benefit is $10,000 in increased home value for each person, and $56,000 of average increased home value for their neighbors.
Net Benefit: The net financial benefit for each group is calculated at the total Benefits, minus the expected costs for that group.
Given this logic, the costs and benefits of our selected model is presented below.
cost_benefit_table_model <-
testProbs %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome=="no" & Outcome=="no"]),
True_Positive = sum(n[predOutcome=="yes" & Outcome=="yes"]),
False_Negative = sum(n[predOutcome=="no" & Outcome=="yes"]),
False_Positive = sum(n[predOutcome=="yes" & Outcome=="no"])) %>%
gather(Variable, Count) %>%
mutate(Cost =
ifelse(Variable == "True_Negative", Count * 0,
ifelse(Variable == "True_Positive", ((2850 + (5000*0.25)) * Count), #25% get the 5k credit
ifelse(Variable == "False_Negative", (0 + 5000*0.25) * Count, #25% get the 5k credit
ifelse(Variable == "False_Positive", (2850) * Count, 0)))),
Benefit =
ifelse(Variable == "True_Negative", Count * 0,
ifelse(Variable == "True_Positive",(((10000 + 56000)*0.25) * Count), #25% get the credit, and therefore the 10k house price equity and 56k equity for their neighbors
ifelse(Variable == "False_Negative", (((10000 + 56000)*0.25) * Count),
ifelse(Variable == "False_Positive", (0) * Count, 0)))),
Net_Benefit = case_when( #benefit - cost
Variable == "True_Negative" ~ (0 - 0),
Variable == "True_Positive" ~ ((((10000 + 56000)*0.25) * Count)) - ((2850 + (5000*0.25)) * Count),
Variable == "False_Negative" ~ ((((10000 + 56000)*0.25) * Count) - (0 + 5000*0.25) * Count),
Variable == "False_Positive" ~ (0 - ((2850) * Count)))) %>%
bind_cols(data.frame(Description = c(
"We correctly predicted they would not enroll in the program",
"We correctly predicted they would enroll in the program; 25% completed to receive the credit",
"We predicted the they would not enroll in the program, but the they did enroll; 25% completed to receive the credit",
"We predicted they would enroll in the program, but they did not enroll")))
cost_ben_totals <- adorn_totals(cost_benefit_table_model, where = "row", fill = "", na.rm = TRUE, name = "Total")
kable(cost_ben_totals,
caption = "Cost/Benefit Table, Selected Model. Threshold is 0.95.",
format.args = list(big.mark = ",")) %>%
kable_classic_2(full_width = F) %>%
kable_styling("striped")
Variable | Count | Cost | Benefit | Net_Benefit | Description |
---|---|---|---|---|---|
True_Negative | 845 | 0 | 0 | 0 | We correctly predicted they would not enroll in the program |
True_Positive | 19 | 77,900 | 313,500 | 235,600 | We correctly predicted they would enroll in the program; 25% completed to receive the credit |
False_Negative | 132 | 165,000 | 2,178,000 | 2,013,000 | We predicted the they would not enroll in the program, but the they did enroll; 25% completed to receive the credit |
False_Positive | 429 | 1,222,650 | 0 | -1,222,650 | We predicted they would enroll in the program, but they did not enroll |
Total | 1,425 | 1,465,550 | 2,491,500 | 1,025,950 |
iterateThresholds <- function(data) {
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
testProbs %>%
mutate(predOutcome = ifelse(Probs > x, "yes", "no")) %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome=="no" & Outcome=="no"]),
True_Positive = sum(n[predOutcome=="yes" & Outcome=="yes"]),
False_Negative = sum(n[predOutcome=="no" & Outcome=="yes"]),
False_Positive = sum(n[predOutcome=="yes" & Outcome=="no"])) %>%
gather(Variable, Count) %>%
mutate(Net_Benefit = #Benefit - Cost
ifelse(Variable == "True_Negative", Count * 0,
ifelse(Variable == "True_Positive", ((((10000 + 56000)*0.25) * Count) - ((2850 + (5000*0.25)) * Count)), # benefit - cost
ifelse(Variable == "False_Negative", ((((10000 + 56000)*0.25) * Count) - ((0 + 5000*0.25) * Count)), #benefit - cost
ifelse(Variable == "False_Positive", (0 - (2850) * Count), 0)))),
Threshold = x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
The plot below illustrates how much we can expect in terms of net financial benefits for each threshold we set the model to. The threshold is the level we set of the likelihood of someone enrolling in the program to be considered a predicted “yes” or “no”. Since program recruitment is costly, this graph shows the financial outlay for the different thresholds for the model.
Said differently, this plot shows how confident we need to be in a homeowner’s likely enrollment into the program to be worth the cost of the program. It appears that we need to be extremely sure - almost 100% sure - that someone will enroll, in order for the program to see a positive net financial benefit.
whichThreshold_model <- iterateThresholds(testProbs)
whichThreshold_revenue_model <-
whichThreshold_model %>%
group_by(Threshold) %>%
summarize(Net_Benefit = sum(Net_Benefit))
ggplot(whichThreshold_revenue_model)+
geom_line(aes(x = Threshold, y = Net_Benefit), color = "#cc4778")+
geom_vline(xintercept = pull(arrange(whichThreshold_revenue_model, -Net_Benefit)[1,1]), color = "#0d0887")+
labs(title = "Model Revenues By Threshold For Test Sample\nSelected Model",
subtitle = "Vertical Line Denotes Optimal Threshold")+
theme_bw()
This plot shows the net financial benefits of the program for each of the confusion matricies, depending on the threshold used by the model. True positives incur the greatest financial net benefits when the threshold is below 0.25, while false negatives increase their benefits as the threshold increases. True negatives stay the same, since their cost and benefits are always zero. The false positives have a negative net benefit until the threshold is close to one.
In sum, this tells us that that the maximum net benefit can be had when the threshold is set very high, and in that case, false negatives will provide the largest source of financial benefits.
whichThreshold_model %>%
ggplot(.,aes(Threshold, Net_Benefit, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette3[c(5, 1:3)]) +
labs(title = "Net Benefit by confusion matrix type and threshold",
y = "Net Benefit") +
theme_bw() +
guides(colour=guide_legend(title = "Confusion Matrix"))
These graphs show the net financial benefits and total credits provided of the credit program, based on the threshold that the model uses to predict program enrollment. Similar to the previous revenue vs. threshold graph, these plots tell us that the threshold must be very high (about 0.95 or more) for the program to see positive financial benefits, and to distribute the maximum number of credits.
whichThreshold_netBen <-
whichThreshold_model %>%
mutate(actualEnroll = ifelse(Variable == "True_Positive", (Count * .25),
ifelse(Variable == "False_Negative", Count, 0))) %>%
group_by(Threshold) %>%
summarize(Net_Benefit = sum(Net_Benefit),
Total_Credits = sum(actualEnroll))
grid.arrange(ncol=2,
ggplot()+
geom_point(data = whichThreshold_netBen , aes(x = Threshold, y = Net_Benefit), color = "#cc4778")+
geom_line(data = whichThreshold_netBen , aes(x = Threshold, y = Net_Benefit), color = "#cc4778")+
labs(title = "Net Benefit vs.\nSelected Threshold",
y = "Net Benefit")+
theme_bw(),
ggplot()+
geom_point(data = whichThreshold_netBen , aes(x = Threshold, y = Total_Credits), color = "#0d0887")+
geom_line(data = whichThreshold_netBen , aes(x = Threshold, y = Total_Credits), color = "#0d0887")+
labs(title = "Total Credits Provided vs.\nSelected Threshold",
y = "Total Credits")+
theme_bw())
The above plots are summarized in the table below. As we can see, the net financial benefit only becomes positive once the threshold value is 0.95 - meaning, we need to be at least 95% sure that someone will enroll in the program for the program to have a net positive impact.
whichThreshold_netBen %>%
slice(which(row_number() %% 5 == 1)-1) %>%
kable(.,
caption = "Thresholds and Associated Net Benefits and Credits, Selected Model",
format.args = list(big.mark = ",")) %>%
kable_classic_2(full_width = F) %>%
kable_styling("striped")
Threshold | Net_Benefit | Total_Credits |
---|---|---|
0.05 | -1,758,500 | 37.75 |
0.10 | -1,758,500 | 37.75 |
0.15 | -1,758,500 | 37.75 |
0.20 | -1,758,500 | 37.75 |
0.25 | -1,758,500 | 37.75 |
0.30 | -1,738,550 | 42.25 |
0.35 | -1,721,450 | 44.50 |
0.40 | -1,721,450 | 44.50 |
0.45 | -1,670,150 | 52.00 |
0.50 | -1,593,200 | 58.75 |
0.55 | -1,567,550 | 62.50 |
0.60 | -1,473,500 | 73.00 |
0.65 | -1,422,200 | 76.00 |
0.70 | -1,370,900 | 82.75 |
0.75 | -1,288,250 | 90.25 |
0.80 | -1,274,000 | 91.75 |
0.85 | -1,259,750 | 91.75 |
0.90 | -498,800 | 112.00 |
0.95 | 1,025,950 | 136.75 |
In a world of where people refuse to answer the phone when an unknown number calls, and trust in governmental institutions is near an all-time low, it is very difficult to successfully recruit homeowners into a governmental program, despite the program’s benefits. With this challenge in mind, my model correctly predicts when homeowners will enroll in the program 12.6% of the time, which is a mild improvement from the program’s previous success rate of 11%.
However, my model provides some significant advantages for the Department: first, my model calculates the probability a homeowner will enroll in the program. This is a huge advantage over not having a model, because we can use that calculated probability to target the highest-probability homeowners. My model also calculates the trade offs in net financial benefits for these associated levels of probabilities (thresholds). Using these calculated probabilities, I was able to pave a way forward for the program to have a net positive benefit to the community: I created a model that will provide more than a million dollars’ worth of increased home value to the community, even when taking the programs’ costs into account.
With my investigation in mind, I have the following suggestions for the marketing campaign. First, it appears that more people tend to enroll in the program in the months of March and June; it may be worth ramping up outreach then. Also, adults over 55 enrolled in the program at a higher rate than younger adults; I suggest targeting that age group instead of the 30-40 year-olds that the program has tended to target.
It is also important to streamline the credit program; for each of the 75% of those who enroll and then drop out of the program, the Department loses $2,850 in recruitment costs, and the community loses out on $66,000 in boosted home values. If the dropout rate could be reduced to 50%, the Department would see the net benefit per enrollee more than double: from $12,400 to $30,500.
While better quality data is not always possible, it could be helpful to include spatial information in the dataset, such as the homeowner’s address or zip code. This could let us understand enrollment data in space, such as by correlating enrollment by the number of neighbors who have previously enrolled in the credit program. This could help the model correctly predict enrollees.
Finally, if the cost per contact for recruitment can be reduced, that would also help the program’s have a greater net financial benefit to the community. It would also mean that we could reach out to a larger number of people, since our threshold of probability of enrollment would not need to be as high.
In summary, this model would be helpful to the Department to boost its positive net financial impact on the community. However, it would be best used to improve the credit program in conjunction with the other organizations within the Department, not in silo by the data science department alone.