Assignment completed for Public Policy Analytics course at the University of Pennsylvania.

Housing Subsidies

Emil City’s (fictional) Department of Housing and Community Development (HCD) offers a robust home repair tax credit program. We provide homeowners who complete our credit program $5,000 towards home repairs of any kind.

Homeowners who have participated in our program have reported high satisfaction with the program. Typically, homeowners see an increase in their home value of $10,000 from making a repair with our credit. Even their neighbors see a benefit - on average, neighbors near a repaired home have a combined house value increase of $56,000.

Despite these clear benefits, HCD has struggled to find homeowners willing to enroll in the program. To increase participation in this program, I am going to use HCD’s previous campaign data to better target program outreach, and thus increase the benefits of the program across Emil City.

# Set up
options(scipen=10000000)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
#library(jtools) 
#library(ggstance)
library(sjPlot)
#install.packages("janitor")
library(janitor)
library(gridExtra)

palette2 <- c("#cc4778","#0d0887")

palette1 = c("#0d0887", "#7e03a8", "#cc4778", "#f89540", "#f0f921")
palette3 = c("#f0f921", "#f89540", "#cc4778", "#7e03a8", "#0d0887")

# Read in data
info <- read_csv("C:\\Users\\radams\\src\\PubPolicyAnalytics\\HW4\\508_HW4\\data\\housingSubsidy.csv")

I’m going to engineer a few features from the data set to help with my data exploration and modeling. In particular, I try to build features that make sense for the context of the housing subsidy. For instance, I create a feature that categorizes the season that the credit is offered in. The logic behind this is that spring and summer is often considered the season of home improvement, and so people may be more interested in the offered credit during these seasons.

Although this code is presented before data exploration, most of this feature engineering took place concurrently to data exploration.

info <- info %>%
  select(-cons.price.idx) %>%
  mutate(
    # ID months in spring/summertime
    season = (ifelse(month %in% c("apr", "may", "jun", "jul", "aug"), "spring_summer", "fall_winter")),
    
    # tease out four seasons
    season4 = (case_when(
      month %in% c("mar", "apr", "may") ~ "spring",
      month %in% c("jun", "jul", "aug") ~ "summer",
      month %in% c("sep", "oct", "nov") ~ "fall",
      month %in% c("dec") ~ "winter")),
    
    # Create age groups
    age_group = case_when(
      age < 30 ~ "18-29",
      age >= 30 & age < 40 ~ "30-39",
      age >= 40 & age < 50 ~ "40-49",
      age >= 50 & age < 60 ~ "50-59",
      age >= 60 & age < 70 ~ "60-69",
      age >= 70 ~ "70+"),
    
    # create group for probable FHA mortgage holders (proxy for first time house owners) 
    first_timers = if_else(age < 40 & mortgage == "yes" & marital %in% c("married", "single"), "first_timers", "not_FHA"),
    
    # Group retirees together
    retirees = if_else(job == "retired", "retiree", "not_retired"),
    
    # Contacted or not
    contacted_ever = if_else(pdays == 999, "not contacted", "contacted"),
    
    # Amount spent on repairs is less/more than the credit
    repairs_amount = case_when(
      spent_on_repairs <= 5000 ~ "within_credit",
      spent_on_repairs > 5000 & spent_on_repairs <= 10000 ~ "beyond_credit"),
    
    # Group jobs together
    job_group = if_else(
      (job %in% c("student", "unemployed")), "limited_means", "means"),
    
    # group inflation
    inflation_group = if_else(inflation_rate >= 4, "high", "low"),
    
    # group consumer confidence index
    cons_conf_group = case_when(
      cons.conf.idx < -40 ~ "very_low",
      cons.conf.idx > -40 & cons.conf.idx < -35 ~ "medium_low",
      cons.conf.idx > -35 ~ "low"),
    
    # education groups
    edu_group = if_else(
      education %in% c("university.degree", "professional.course"),"advanced", "other"),
    
    summer_fridays = if_else(season4 == "summer" & day_of_week == "fri", "summer_friday", "not_summer_friday"),
    
    over_50 = if_else(age >= 50, "over_50", "under_50"),
    
    unemp_group = case_when(
      unemploy_rate < -2 ~ "very low",
      unemploy_rate > -2 & unemploy_rate < -1 ~ "lower",
      unemploy_rate > -1 & unemploy_rate < 0 ~ "low",
      unemploy_rate > 0 ~ "positive"),
    
    # cons_price_group = case_when(
    #   cons.price.idx < 92.7 ~ "under_92.7",
    #   cons.price.idx >= 92.7 & cons.price.idx < 93.3 ~ "93ish",
    #   cons.price.idx >= 93.3 & cons.price.idx < 93.7 ~ "93.5ish",
    #   cons.price.idx >= 93.5 & cons.price.idx < 94.3 ~ "94ish",
    #   cons.price.idx >= 94.3 ~ "94+ish"),
    
    log_age = log(age),
    
    #log_unemp = log(unemploy_rate)+1,
    
    #log_cons_conf_idx = log(cons.conf.idx)+1,
    
    #log_cons_price_idx = log(cons.price.idx),
    
    log_inflation = log(inflation_rate),
    
    log_spend = log(spent_on_repairs),
    
    persistent_elders = if_else(age > 60 & previous > 0, "persistent_elder", "non_persistent_elder"),
    
    mature_white_collar = if_else(job %in% c("admin.", "technician") & edu_group == "advanced" & over_50 == "over_50" & previous > 0, "mat_white_collar", "not_white_collar")
    )

Correlations

Original variables

Age: Below is a density plot of the original age variable. This density plot shows that the program previously primarily targeted people around 30-40 years old. Despite this, these groups had more “nos” than “yeses”. After about age 60, it appears that more people agreed to enroll in the program. This indicates that a feature built around this older age group may offer a path towards more “yeses”. The second graph illustrates the average age for groups that said “no” and “yes” respectively. It appears that there is not much difference between the two groups according to this bar plot: the “yes” group, at an average age of about 42, tends to be slightly older than the “no” group (average age is about 40).

grid.arrange(ncol=2,
info %>%
    dplyr::select(y, age) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_color_manual(values = palette2) +
    labs(title = "Feature distributions credit vs.\nno credit",
         subtitle = "(continous outcomes)"),
info %>%
  dplyr::select(y, age) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_fill_manual(values = palette2) +
      labs(x="y", y="Value", 
           title = "Feature associations with the\nlikelihood of taking the credit",
           subtitle = "(continous outcomes)") +
      theme(legend.position = "none"))

Day of the Week: Below is a graph of average responses to the campaign for different days of the week. It appears that there is not much correlation between the day of the week and whether a person opted in to the program.

info %>%
    dplyr::select(y, day_of_week) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

New features

Season: I created a feature that classifies the month we last contacted a houseowner into its associated season: spring/summer or fall/winter. My hypothesis is that more people are interested in taking on house improvement projects during the warmer months. This hypothesis is borne out according to the graph below. More people took the credit in summer months:

info %>%
    dplyr::select(y, season) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

First-timers: I created a feature to proxy new house owners. This category includes all people below 40 years old who have a mortgage and are either married or single. Interestingly, people who were not identified as first timers took the credit more often.

info %>%
    dplyr::select(y, first_timers) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Repairs amount: I log-transformed the original variable of the average amount people spent on their home repairs. It appears that as the logarithmic value of their repairs increases, people are less likely to enroll in the credit program.

info %>%
    dplyr::select(y, log_spend) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_color_manual(values = palette2) +
    labs(title = "Feature distributions credit vs. no credit",
         subtitle = "(continous outcomes)")

Persistent Elders: I created a variable to capture homeowners over the age of 60 who had been previously contacted by the campaign. I found that this group was as likely to enroll in the program as not. While this 50/50 split may not seem favorable, the high proportion of “nos” across the dataset made this a key insight. However, since this group was only about 16 members, it wasn’t ultimately useful for the model.

info %>%
    dplyr::select(y, persistent_elders) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
    filter(value == "persistent_elder") %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free") +
        scale_fill_manual(values = palette2) +
        labs(x="y", y="Value",
             title = "Feature associations with the likelihood of taking the credit",
             subtitle = "Categorical features") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))