Assignment completed for Public Policy Analytics course at the University of Pennsylvania.

Housing Subsidies

Emil City’s (fictional) Department of Housing and Community Development (HCD) offers a robust home repair tax credit program. We provide homeowners who complete our credit program $5,000 towards home repairs of any kind.

Homeowners who have participated in our program have reported high satisfaction with the program. Typically, homeowners see an increase in their home value of $10,000 from making a repair with our credit. Even their neighbors see a benefit - on average, neighbors near a repaired home have a combined house value increase of $56,000.

Despite these clear benefits, HCD has struggled to find homeowners willing to enroll in the program. To increase participation in this program, I am going to use HCD’s previous campaign data to better target program outreach, and thus increase the benefits of the program across Emil City.

# Set up
options(scipen=10000000)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
#library(jtools) 
#library(ggstance)
library(sjPlot)
#install.packages("janitor")
library(janitor)
library(gridExtra)

palette2 <- c("#cc4778","#0d0887")

palette1 = c("#0d0887", "#7e03a8", "#cc4778", "#f89540", "#f0f921")
palette3 = c("#f0f921", "#f89540", "#cc4778", "#7e03a8", "#0d0887")

# Read in data
info <- read_csv("C:\\Users\\radams\\src\\PubPolicyAnalytics\\HW4\\508_HW4\\data\\housingSubsidy.csv")

I’m going to engineer a few features from the data set to help with my data exploration and modeling. In particular, I try to build features that make sense for the context of the housing subsidy. For instance, I create a feature that categorizes the season that the credit is offered in. The logic behind this is that spring and summer is often considered the season of home improvement, and so people may be more interested in the offered credit during these seasons.

Although this code is presented before data exploration, most of this feature engineering took place concurrently to data exploration.

info <- info %>%
  select(-cons.price.idx) %>%
  mutate(
    # ID months in spring/summertime
    season = (ifelse(month %in% c("apr", "may", "jun", "jul", "aug"), "spring_summer", "fall_winter")),
    
    # tease out four seasons
    season4 = (case_when(
      month %in% c("mar", "apr", "may") ~ "spring",
      month %in% c("jun", "jul", "aug") ~ "summer",
      month %in% c("sep", "oct", "nov") ~ "fall",
      month %in% c("dec") ~ "winter")),
    
    # Create age groups
    age_group = case_when(
      age < 30 ~ "18-29",
      age >= 30 & age < 40 ~ "30-39",
      age >= 40 & age < 50 ~ "40-49",
      age >= 50 & age < 60 ~ "50-59",
      age >= 60 & age < 70 ~ "60-69",
      age >= 70 ~ "70+"),
    
    # create group for probable FHA mortgage holders (proxy for first time house owners) 
    first_timers = if_else(age < 40 & mortgage == "yes" & marital %in% c("married", "single"), "first_timers", "not_FHA"),
    
    # Group retirees together
    retirees = if_else(job == "retired", "retiree", "not_retired"),
    
    # Contacted or not
    contacted_ever = if_else(pdays == 999, "not contacted", "contacted"),
    
    # Amount spent on repairs is less/more than the credit
    repairs_amount = case_when(
      spent_on_repairs <= 5000 ~ "within_credit",
      spent_on_repairs > 5000 & spent_on_repairs <= 10000 ~ "beyond_credit"),
    
    # Group jobs together
    job_group = if_else(
      (job %in% c("student", "unemployed")), "limited_means", "means"),
    
    # group inflation
    inflation_group = if_else(inflation_rate >= 4, "high", "low"),
    
    # group consumer confidence index
    cons_conf_group = case_when(
      cons.conf.idx < -40 ~ "very_low",
      cons.conf.idx > -40 & cons.conf.idx < -35 ~ "medium_low",
      cons.conf.idx > -35 ~ "low"),
    
    # education groups
    edu_group = if_else(
      education %in% c("university.degree", "professional.course"),"advanced", "other"),
    
    summer_fridays = if_else(season4 == "summer" & day_of_week == "fri", "summer_friday", "not_summer_friday"),
    
    over_50 = if_else(age >= 50, "over_50", "under_50"),
    
    unemp_group = case_when(
      unemploy_rate < -2 ~ "very low",
      unemploy_rate > -2 & unemploy_rate < -1 ~ "lower",
      unemploy_rate > -1 & unemploy_rate < 0 ~ "low",
      unemploy_rate > 0 ~ "positive"),
    
    # cons_price_group = case_when(
    #   cons.price.idx < 92.7 ~ "under_92.7",
    #   cons.price.idx >= 92.7 & cons.price.idx < 93.3 ~ "93ish",
    #   cons.price.idx >= 93.3 & cons.price.idx < 93.7 ~ "93.5ish",
    #   cons.price.idx >= 93.5 & cons.price.idx < 94.3 ~ "94ish",
    #   cons.price.idx >= 94.3 ~ "94+ish"),
    
    log_age = log(age),
    
    #log_unemp = log(unemploy_rate)+1,
    
    #log_cons_conf_idx = log(cons.conf.idx)+1,
    
    #log_cons_price_idx = log(cons.price.idx),
    
    log_inflation = log(inflation_rate),
    
    log_spend = log(spent_on_repairs),
    
    persistent_elders = if_else(age > 60 & previous > 0, "persistent_elder", "non_persistent_elder"),
    
    mature_white_collar = if_else(job %in% c("admin.", "technician") & edu_group == "advanced" & over_50 == "over_50" & previous > 0, "mat_white_collar", "not_white_collar")
    )

Correlations

Original variables

Age: Below is a density plot of the original age variable. This density plot shows that the program previously primarily targeted people around 30-40 years old. Despite this, these groups had more “nos” than “yeses”. After about age 60, it appears that more people agreed to enroll in the program. This indicates that a feature built around this older age group may offer a path towards more “yeses”. The second graph illustrates the average age for groups that said “no” and “yes” respectively. It appears that there is not much difference between the two groups according to this bar plot: the “yes” group, at an average age of about 42, tends to be slightly older than the “no” group (average age is about 40).

grid.arrange(ncol=2,
info %>%
    dplyr::select(y, age) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_color_manual(values = palette2) +
    labs(title = "Feature distributions credit vs.\nno credit",
         subtitle = "(continous outcomes)"),
info %>%
  dplyr::select(y, age) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="y", y="Value", 
           title = "Feature associations with the\nlikelihood of taking the credit",
           subtitle = "(continous outcomes)") +
      theme(legend.position = "none"))

Day of the Week: Below is a graph of average responses to the campaign for different days of the week. It appears that there is not much correlation between the day of the week and whether a person opted in to the program.

info %>%
    dplyr::select(y, day_of_week) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

New features

Season: I created a feature that classifies the month we last contacted a houseowner into its associated season: spring/summer or fall/winter. My hypothesis is that more people are interested in taking on house improvement projects during the warmer months. This hypothesis is borne out according to the graph below. More people took the credit in summer months:

info %>%
    dplyr::select(y, season) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

First-timers: I created a feature to proxy new house owners. This category includes all people below 40 years old who have a mortgage and are either married or single. Interestingly, people who were not identified as first timers took the credit more often.

info %>%
    dplyr::select(y, first_timers) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Repairs amount: I log-transformed the original variable of the average amount people spent on their home repairs. It appears that as the logarithmic value of their repairs increases, people are less likely to enroll in the credit program.

info %>%
    dplyr::select(y, log_spend) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_color_manual(values = palette2) +
    labs(title = "Feature distributions credit vs. no credit",
         subtitle = "(continous outcomes)")

Persistent Elders: I created a variable to capture homeowners over the age of 60 who had been previously contacted by the campaign. I found that this group was as likely to enroll in the program as not. While this 50/50 split may not seem favorable, the high proportion of “nos” across the dataset made this a key insight. However, since this group was only about 16 members, it wasn’t ultimately useful for the model.

info %>%
    dplyr::select(y, persistent_elders) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
    filter(value == "persistent_elder") %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Regressions

Kitchen Sink Model

My first model is the kitchen sink model, which includes all of the original and engineered variables. This is unlikely to be a successful model, but the results can be helpful to understand what may be important for my final model.

info <- info %>%
  na.omit() %>%
  mutate(y_fct = factor(y, levels = c("yes","no"))) %>%
  select(-`...1`)

set.seed(4119)
trainIndex <- createDataPartition(
  y = paste(info$education, info$taxLien), p = .65, # 65 training/35 test set
                                  list = FALSE,
                                  times = 1)
subsidyTrain <- info[ trainIndex,]
subsidyTest  <- info[-trainIndex,]


# Kitchen sink model
kitchen_sink <- glm(subsidyTrain$y_fct ~ .,
                  data=subsidyTrain %>% 
                    dplyr::select(-y, -y_numeric),
                  family="binomial" (link="logit"))

#summary(kitchen_sink) # AIC: 1502
#pR2(kitchen_sink) # McFadden: 0.256
kitchen_sink_summary <- tab_model(kitchen_sink, show.reflvl = T, show.intercept = T, p.style = "numeric_stars")

kitchen_sink_summary
  subsidyTrain$y_fct
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 384302622622991782122466022886440264880600240.00 0.504
age 1.03 0.85 – 1.26 0.749
age_group30-39 0.62 0.66 – 1.28 0.204
age_group40-49 0.52 0.56 – 0.48 0.275
age_group50-59 0.27 0.31 – 0.23 0.099
age_group60-69 0.18 0.23 – 1.54 0.117
age_group70+ 0.12 0.07 – 0.33 0.201
campaign 1.07 0.99 – 1.17 0.093
cons.conf.idx 1.16 0.97 – 1.39 0.114
cons_conf_groupmedium_low 2.10 1.40 – 14.86 0.457
cons_conf_groupvery_low 6.07 0.54 – 4.94 0.136
contacted_evernot contacted 0.00 14.87 – 0.00 0.914
contacttelephone 2.30 ** 1.28 – 4.38 0.007
day_of_weekmon 1.23 1.11 – 2.14 0.462
day_of_weekthu 1.38 0.78 – 1.31 0.267
day_of_weektue 1.13 0.64 – 1.98 0.671
day_of_weekwed 1.23 0.69 – 2.18 0.484
educationbasic.6y 0.68 0.73 – 1.63 0.370
educationbasic.9y 0.76 0.43 – 1.52 0.439
educationhigh.school 0.76 0.81 – 1.47 0.435
educationilliterate 7855649312.24 0.00 – Inf 1.000
educationprofessional.course 0.74 0.35 – 1.55 0.433
educationuniversity.degree 0.73 0.36 – 1.42 0.361
educationunknown 0.84 0.34 – 2.15 0.712
first_timersnot_FHA 0.76 0.44 – 0.72 0.336
inflation_grouplow 2758457335.03 0.00 – NA 1.000
inflation_rate 0.00 ** 0.00 – 0.20 0.005
jobblue-collar 1.30 0.75 – 1.27 0.350
jobentrepreneur 1.74 0.71 – 5.25 0.266
jobhousemaid 0.93 0.37 – 2.68 0.888
jobmanagement 1.82 1.92 – 1.72 0.062
jobretired 1.75 1.55 – 4.35 0.216
jobself-employed 1.94 0.87 – 4.79 0.126
jobservices 1.20 0.67 – 1.15 0.548
jobstudent 1.25 0.47 – 3.48 0.659
jobtechnician 1.11 0.68 – 1.82 0.675
jobunemployed 1.10 0.48 – 2.80 0.833
jobunknown 0.96 0.23 – 1.78 0.958
log_age 1.12 0.34 – 1722.52 0.976
log_inflation 316.26 0.34 – 342550.85 0.100
maritalmarried 1.31 0.81 – 2.08 0.261
maritalsingle 1.08 0.62 – 1.88 0.777
maritalunknown 1.08 0.13 – 7.88 0.950
mature_white_collarnot_white_collar 2.31 2.58 – 2.06 0.228
monthaug 0.96 1.15 – 0.84 0.972
monthdec 0.69 0.08 – 6.26 0.746
monthjul 0.57 0.06 – 1.43 0.630
monthjun 0.28 0.32 – 0.31 0.262
monthmar 0.47 0.15 – 1.47 0.192
monthmay 1.03 0.48 – 2.18 0.930
monthnov 2.69 4.14 – 3.28 0.253
monthoct 1.26 0.22 – 7.24 0.792
monthsep 1.47 0.29 – 7.60 0.643
mortgageunknown 1.27 0.46 – 4.30 0.672
mortgageyes 1.00 0.67 – 0.97 0.997
pdays 1.01 0.88 – 1.15 0.906
persistent_elderspersistent_elder 0.34 0.06 – 1.76 0.197
poutcomenonexistent 0.31 ** 0.13 – 0.67 0.004
poutcomesuccess 0.30 0.06 – 1.62 0.160
previous 0.58 * 0.33 – 0.95 0.042
repairs_amountwithin_credit 0.66 0.87 – 25.82 0.822
spent_on_repairs 1.01 0.98 – 1.01 0.358
summer_fridayssummer_friday 1.30 1.39 – 1.21 0.500
taxbill_in_phlyes 1.06 1.09 – 1.56 0.756
taxLienunknown 0.88 0.71 – 0.80 0.567
taxLienyes 2988974.39 0.00 – Inf 0.998
unemp_grouplower 0.00 NA – Inf 1.000
unemp_grouppositive 22.85 26.41 – 606.90 0.061
unemp_groupvery low 0.00 NA – Inf 1.000
unemploy_rate 2.45 2.71 – 2.20 0.229
Observations 2671
R2 Tjur 0.254
  • p<0.05   ** p<0.01   *** p<0.001

Selected Model

My final selected model is below. This model was selected after an iterative process of comparing 21 different models. This model predicts homeowners’ decision to enroll in the program based on the month they were contacted; the log-transformed value of the amount the homeowners spend on repairs; and how they are contacted.

# MODEL 20
model_20_vars = c("month", "log_spend", "contact")
model_20 <- glm(subsidyTrain$y_fct ~ .,
                   data=subsidyTrain %>% 
                     dplyr::select(all_of(model_20_vars)),
                   family="binomial" (link="logit"))
 
#summary(model_20) # AIC: 1488.1
#pR2(model_20) # MF: 0.206
model_summary <- tab_model(model_20, show.reflvl = T, show.intercept = T, p.style = "numeric_stars")

model_summary
  subsidyTrain$y_fct
Predictors Odds Ratios CI p
(Intercept) 0.00 *** 0.00 – 0.00 <0.001
contacttelephone 1.83 ** 1.23 – 2.78 0.004
log_spend 3886113395056675785266666208.00 *** 211383397079214022720046.00 – 82352459382323025392242280482264.00 <0.001
monthaug 0.55 0.29 – 0.99 0.052
monthdec 0.28 * 0.08 – 0.91 0.036
monthjul 0.46 * 0.24 – 0.86 0.017
monthjun 0.29 *** 0.15 – 0.55 <0.001
monthmar 0.18 *** 0.08 – 0.43 <0.001
monthmay 1.28 0.70 – 2.26 0.409
monthnov 0.89 0.46 – 1.70 0.726
monthoct 0.70 0.31 – 1.59 0.401
monthsep 0.73 0.32 – 1.66 0.452
Observations 2671
R2 Tjur 0.193
  • p<0.05   ** p<0.01   *** p<0.001

Cross-validation

# Kitchen sink CV
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

kitchen_sink_cvFit <- train(y_fct ~ ., data = info %>% 
                                   dplyr::select(-y, -y_numeric), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

#kitchen_sink_cvFit


# Selected model CV
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

model_cvFit <- train(y_fct ~ ., data = info %>% 
                                   dplyr::select(y_fct, all_of(model_20_vars)), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

#model_cvFit

Sensitivity and Specificity

The Goodness of Fit Metrics for the cross-validations of both my models are presented below. The figures below plot the distribution of Area Under the Curve (AUC), Sensitivity (the rate at which it correctly predicts people will take the credit - true positives), and Specificity (the rate at which is correctly predicts people will not take the credit - true negatives) across 100 folds.

Kitchen Sink Model My Kitchen Sink model resulted in an AUC of 0.759, a sensitivity of 0.21, and specificity of 0.98. This model does not generalize well with respect to Sensitivity, suggesting it is likely to overpredict true positives. Its Specificity is very high, suggesting it will correctly predict true negatives most of the time. Its AUC is 0.759, indicating a high ratio of true positives to false positives.

dplyr::select(kitchen_sink_cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(kitchen_sink_cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#7e03a8") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#f89540", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics - Kitchen Sink Model",
         subtitle = "Across-fold mean reprented as dotted lines") +
    theme_bw()

Selected Model My selected model resulted in an AUC of 0.77, a sensitivity of 0.139, and specificity of 0.98. This model does not generalize well with respect to Sensitivity - this means that this selected model is likely to incorrectly predict true positives most of the time. Its specificity is very high, indicating that it is likely to correctly predict true negatives most of the time. Its AUC is 0.77, indicating that the model predicts true positives at a satisfactory rate compared to false positives.

dplyr::select(model_cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(model_cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#7e03a8") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#f89540", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics - Selected Model",
         subtitle = "Across-fold mean reprented as dotted lines") +
    theme_bw()

Predictions : I split my data set into a 65% training and 35% testing set. After training my selected model on the training set, I then made predictions on the testing set. After that, I created a confusion matrix of the outcomes in order to create an ROC curve (next section).

testProbs <- data.frame(Outcome = as.factor(subsidyTest$y_fct),
                        Probs = predict(model_20, subsidyTest, type= "response"))
testProbs <- 
  testProbs %>%
  mutate(predOutcome  = factor(
    ifelse(testProbs$Probs > 0.95 , "yes", "no"),
    levels = c("yes", "no"))) # ensure that Probs' levels match Outcome's levels.

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "yes")

ROC Curve

Below is the ROC (Receiver Operating Characteristic) curve for my selected model. The total area under the curve (AUC) is 0.754, which suggests that the model’s ratio of true positives to false positives is satisfactory. This model does a better job at predicting who will take the credit than flipping a coin would.

From this plot, we see that when this model correctly predicts someone will enroll in the program at a rate of 25%, the model will also make the same prediction, but incorrectly at a rate of 10%. It appears that the best trade-off of these correct and incorrect positive predictions is when the model makes correct positive predictions about 80% of the time, and incorrect positive predictions about 35% of the time.

Model

auc(testProbs$Outcome, testProbs$Probs)
ggplot(testProbs, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#cc4778") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Selected Model",
       subtitle = "Area under the curve: 0.754")+
  theme_bw()

Cost-Benefit Analysis

Calculating costs and benefits for a public service good is not as straight forward as it may be for other goods and services. For the purposes of this analysis, I’ve calculated the costs and benefits as follows:

kable(df,
       caption = "Cost/Benefit Equations (all figures in USD)",
      format.args = list(big.mark = ",")) %>% 
  kable_classic_2(full_width = F) %>%
  kable_styling("striped")
Cost/Benefit Equations (all figures in USD)
Category Description Cost_Equation Cost_per_Enrollee Benefit_Equation Benefit_per_Enrollee Net_Benefit_per_Enrollee
True Positive Cost per person who enrolls in the program, as predicted 2,850 + (5,000 * 25% of enrollees) 4,100 (10,000 + 56,000) * 25% of enrollees 16,500 12,400
True Negative Cost per person who doesn’t enroll in the program, as predicted 0 0 0 0 0
False Positive Cost per person who doesn’t enroll in the program, but was predicted to do so 2,850 2,850 0 0 -2,850
False Negative Cost per person who does enroll in the program, but was predicted not to 5,000 * 25% of enrollees 1,250 (10,000 + 56,000) * 25% of enrollees 16,500 15,250

Costs: Financial costs are calculated as described in the table above. The cost per person for outreach is $2,850. This is only spent when the model predicts the person will enroll in the program. The cost to provide a credit is $5,000, which will be provided to everyone who completes the credit program (an estimated 25% of people who enroll in the program).

Benefits: The financial benefits of the program only exist for people who complete the program (estimated at 25% of those who enroll). For this group, the benefit is $10,000 in increased home value for each person, and $56,000 of average increased home value for their neighbors.

Net Benefit: The net financial benefit for each group is calculated at the total Benefits, minus the expected costs for that group.

Given this logic, the costs and benefits of our selected model is presented below.

cost_benefit_table_model <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome=="no" & Outcome=="no"]),
                True_Positive = sum(n[predOutcome=="yes" & Outcome=="yes"]),
                False_Negative = sum(n[predOutcome=="no" & Outcome=="yes"]),
                False_Positive = sum(n[predOutcome=="yes" & Outcome=="no"])) %>%
       gather(Variable, Count) %>%
       mutate(Cost =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive", ((2850 + (5000*0.25)) * Count), #25% get the 5k credit
               ifelse(Variable == "False_Negative", (0 + 5000*0.25) * Count, #25% get the 5k credit
               ifelse(Variable == "False_Positive", (2850) * Count, 0)))),
              Benefit =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",(((10000 + 56000)*0.25) * Count), #25% get the credit, and therefore the 10k house price equity and 56k equity for their neighbors
               ifelse(Variable == "False_Negative", (((10000 + 56000)*0.25) * Count),
               ifelse(Variable == "False_Positive", (0) * Count, 0)))),
              Net_Benefit = case_when( #benefit - cost
                Variable == "True_Negative" ~ (0 - 0),
                Variable == "True_Positive" ~ ((((10000 + 56000)*0.25) * Count)) - ((2850 + (5000*0.25)) * Count),
              Variable == "False_Negative" ~ ((((10000 + 56000)*0.25) * Count) - (0 + 5000*0.25) * Count),
              Variable == "False_Positive" ~ (0 - ((2850) * Count)))) %>%
    bind_cols(data.frame(Description = c(
              "We correctly predicted they would not enroll in the program",
              "We correctly predicted they would enroll in the program; 25% completed to receive the credit",
              "We predicted the they would not enroll in the program, but the they did enroll; 25% completed to receive the credit",
              "We predicted they would enroll in the program, but they did not enroll")))


cost_ben_totals <- adorn_totals(cost_benefit_table_model, where = "row", fill = "", na.rm = TRUE, name = "Total")

kable(cost_ben_totals,
       caption = "Cost/Benefit Table, Selected Model. Threshold is 0.95.",
      format.args = list(big.mark = ",")) %>% 
  kable_classic_2(full_width = F) %>%
  kable_styling("striped")
Cost/Benefit Table, Selected Model. Threshold is 0.95.
Variable Count Cost Benefit Net_Benefit Description
True_Negative 845 0 0 0 We correctly predicted they would not enroll in the program
True_Positive 19 77,900 313,500 235,600 We correctly predicted they would enroll in the program; 25% completed to receive the credit
False_Negative 132 165,000 2,178,000 2,013,000 We predicted the they would not enroll in the program, but the they did enroll; 25% completed to receive the credit
False_Positive 429 1,222,650 0 -1,222,650 We predicted they would enroll in the program, but they did not enroll
Total 1,425 1,465,550 2,491,500 1,025,950

Net Benefits Per Threshold

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs %>%
      mutate(predOutcome = ifelse(Probs > x, "yes", "no")) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome=="no" & Outcome=="no"]),
                True_Positive = sum(n[predOutcome=="yes" & Outcome=="yes"]),
                False_Negative = sum(n[predOutcome=="no" & Outcome=="yes"]),
                False_Positive = sum(n[predOutcome=="yes" & Outcome=="no"])) %>%
     gather(Variable, Count) %>%
     mutate(Net_Benefit = #Benefit - Cost
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive", ((((10000 + 56000)*0.25) * Count) - ((2850 + (5000*0.25)) * Count)), # benefit - cost
               ifelse(Variable == "False_Negative", ((((10000 + 56000)*0.25) * Count) - ((0 + 5000*0.25) * Count)), #benefit - cost
               ifelse(Variable == "False_Positive", (0 - (2850) * Count), 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

The plot below illustrates how much we can expect in terms of net financial benefits for each threshold we set the model to. The threshold is the level we set of the likelihood of someone enrolling in the program to be considered a predicted “yes” or “no”. Since program recruitment is costly, this graph shows the financial outlay for the different thresholds for the model.

Said differently, this plot shows how confident we need to be in a homeowner’s likely enrollment into the program to be worth the cost of the program. It appears that we need to be extremely sure - almost 100% sure - that someone will enroll, in order for the program to see a positive net financial benefit.

whichThreshold_model <- iterateThresholds(testProbs)

whichThreshold_revenue_model <- 
whichThreshold_model %>% 
    group_by(Threshold) %>% 
    summarize(Net_Benefit = sum(Net_Benefit))

  ggplot(whichThreshold_revenue_model)+
  geom_line(aes(x = Threshold, y = Net_Benefit), color = "#cc4778")+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue_model, -Net_Benefit)[1,1]), color = "#0d0887")+
    labs(title = "Model Revenues By Threshold For Test Sample\nSelected Model",
         subtitle = "Vertical Line Denotes Optimal Threshold")+
    theme_bw()

Confusion Matricies for each Threshold

This plot shows the net financial benefits of the program for each of the confusion matricies, depending on the threshold used by the model. True positives incur the greatest financial net benefits when the threshold is below 0.25, while false negatives increase their benefits as the threshold increases. True negatives stay the same, since their cost and benefits are always zero. The false positives have a negative net benefit until the threshold is close to one.

In sum, this tells us that that the maximum net benefit can be had when the threshold is set very high, and in that case, false negatives will provide the largest source of financial benefits.

whichThreshold_model %>%
  ggplot(.,aes(Threshold, Net_Benefit, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette3[c(5, 1:3)]) +    
  labs(title = "Net Benefit by confusion matrix type and threshold",
       y = "Net Benefit") +
  theme_bw() +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

Threshold as a function of total revenue and total count of credits

These graphs show the net financial benefits and total credits provided of the credit program, based on the threshold that the model uses to predict program enrollment. Similar to the previous revenue vs. threshold graph, these plots tell us that the threshold must be very high (about 0.95 or more) for the program to see positive financial benefits, and to distribute the maximum number of credits.

whichThreshold_netBen <- 
  whichThreshold_model %>% 
    mutate(actualEnroll = ifelse(Variable == "True_Positive", (Count * .25),
                         ifelse(Variable == "False_Negative", Count, 0))) %>% 
    group_by(Threshold) %>% 
    summarize(Net_Benefit = sum(Net_Benefit),
              Total_Credits = sum(actualEnroll)) 
grid.arrange(ncol=2,

ggplot()+
  geom_point(data = whichThreshold_netBen , aes(x = Threshold, y = Net_Benefit), color = "#cc4778")+
  geom_line(data = whichThreshold_netBen , aes(x = Threshold, y = Net_Benefit), color = "#cc4778")+
  labs(title = "Net Benefit vs.\nSelected Threshold",
       y = "Net Benefit")+
  theme_bw(),

ggplot()+
  geom_point(data = whichThreshold_netBen , aes(x = Threshold, y = Total_Credits), color = "#0d0887")+
  geom_line(data = whichThreshold_netBen , aes(x = Threshold, y = Total_Credits), color = "#0d0887")+
  labs(title = "Total Credits Provided vs.\nSelected Threshold",
       y = "Total Credits")+
  theme_bw())

The above plots are summarized in the table below. As we can see, the net financial benefit only becomes positive once the threshold value is 0.95 - meaning, we need to be at least 95% sure that someone will enroll in the program for the program to have a net positive impact.

whichThreshold_netBen %>%
  slice(which(row_number() %% 5 == 1)-1) %>%
  kable(.,
       caption = "Thresholds and Associated Net Benefits and Credits, Selected Model",
      format.args = list(big.mark = ",")) %>% 
  kable_classic_2(full_width = F) %>%
  kable_styling("striped")
Thresholds and Associated Net Benefits and Credits, Selected Model
Threshold Net_Benefit Total_Credits
0.05 -1,758,500 37.75
0.10 -1,758,500 37.75
0.15 -1,758,500 37.75
0.20 -1,758,500 37.75
0.25 -1,758,500 37.75
0.30 -1,738,550 42.25
0.35 -1,721,450 44.50
0.40 -1,721,450 44.50
0.45 -1,670,150 52.00
0.50 -1,593,200 58.75
0.55 -1,567,550 62.50
0.60 -1,473,500 73.00
0.65 -1,422,200 76.00
0.70 -1,370,900 82.75
0.75 -1,288,250 90.25
0.80 -1,274,000 91.75
0.85 -1,259,750 91.75
0.90 -498,800 112.00
0.95 1,025,950 136.75

Conclusion

In a world of where people refuse to answer the phone when an unknown number calls, and trust in governmental institutions is near an all-time low, it is very difficult to successfully recruit homeowners into a governmental program, despite the program’s benefits. With this challenge in mind, my model correctly predicts when homeowners will enroll in the program 12.6% of the time, which is a mild improvement from the program’s previous success rate of 11%.

However, my model provides some significant advantages for the Department: first, my model calculates the probability a homeowner will enroll in the program. This is a huge advantage over not having a model, because we can use that calculated probability to target the highest-probability homeowners. My model also calculates the trade offs in net financial benefits for these associated levels of probabilities (thresholds). Using these calculated probabilities, I was able to pave a way forward for the program to have a net positive benefit to the community: I created a model that will provide more than a million dollars’ worth of increased home value to the community, even when taking the programs’ costs into account.

With my investigation in mind, I have the following suggestions for the marketing campaign. First, it appears that more people tend to enroll in the program in the months of March and June; it may be worth ramping up outreach then. Also, adults over 55 enrolled in the program at a higher rate than younger adults; I suggest targeting that age group instead of the 30-40 year-olds that the program has tended to target.

It is also important to streamline the credit program; for each of the 75% of those who enroll and then drop out of the program, the Department loses $2,850 in recruitment costs, and the community loses out on $66,000 in boosted home values. If the dropout rate could be reduced to 50%, the Department would see the net benefit per enrollee more than double: from $12,400 to $30,500.

While better quality data is not always possible, it could be helpful to include spatial information in the dataset, such as the homeowner’s address or zip code. This could let us understand enrollment data in space, such as by correlating enrollment by the number of neighbors who have previously enrolled in the credit program. This could help the model correctly predict enrollees.

Finally, if the cost per contact for recruitment can be reduced, that would also help the program’s have a greater net financial benefit to the community. It would also mean that we could reach out to a larger number of people, since our threshold of probability of enrollment would not need to be as high.

In summary, this model would be helpful to the Department to boost its positive net financial impact on the community. However, it would be best used to improve the credit program in conjunction with the other organizations within the Department, not in silo by the data science department alone.